Planet Haskell

August 24, 2025

Abhinav Sarkar

A Fast Bytecode VM for Arithmetic: The Compiler

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics:

This post was originally published on abhinavsarkar.net.

This is the second post in a series of posts:

  1. A Fast Bytecode VM for Arithmetic: The Parser
  2. A Fast Bytecode VM for Arithmetic: The Compiler
  3. A Fast Bytecode VM for Arithmetic: The Virtual Machine

In this post, we write the compiler for our AST to bytecode, and a decompiler for the bytecode.

Introduction

AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST, the interpreter jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode.

Bytecode is a flattened and compact representation of a program, usually manifested as a byte array. Bytecode is essentially an Instruction Set (IS), but custom-made to be executed by a Virtual Machine (VM), instead of a physical machine. Each bytecode instruction is one byte in size (that’s where it gets its name from). A bytecode and its VM are created in synergy so that the execution is as efficient as possible1. Compiling source code to bytecode and executing it in a VM also allows the program to be run on all platforms that the VM supports without the developer caring much about portability concerns. The most popular combo of bytecode and VM is probably the Java bytecode and the Java virtual machine.

The VMs can be stack-based or register-based. In a stack-based VM, all values created during the execution of a program are stored only in a Stack data-structure residing in the memory. Whereas, in a register-based VM, there is also an additional set of fixed number of registers that are used to store values in preference to the stack2. Register-based VMs are usually faster, but stack-based VMs are usually simpler to implement. For our purpose, we choose to implement a stack-based VM.

We are going to write a compiler that compiles our expression AST to bytecode. But first, let’s design the bytecode for our stack-based VM.

The Bytecode

Here is our expression AST as a reminder:

data Expr
  = Num !Int16
  | Var !Ident
  | BinOp !Op Expr Expr
  | Let !Ident Expr Expr
  deriving (Eq, Generic)

newtype Ident = Ident {unIdent :: BS.ByteString}
  deriving (Eq, Ord, Generic, Hashable)

data Op = Add | Sub | Mul | Div deriving (Eq, Enum, Generic)
ArithVMLib.hs

Let’s figure out the right bytecode for each case. First, we create Opcodes for each bytecode, which are sort of mnemonics for actual bytecode. Think of them as Assembly is to Machine Code.

Num

For a number literal, we need to put it directly in the bytecode so that we can use it later during the execution. We also need an opcode to push it on the stack. Let’s call it OPush with an Int16 parameter.

BinOp

Binary operations recursively use Expr for their operands. To evaluate a binary operation, we need its operands to be evaluated before, so we compile them first to bytecode. After that, all we need is an opcode per operator. Let’s call them OAdd, OSub, OMul, and ODiv for Add, Sub, Mul, and Div operators respectively.

Var and Let

Variables and Let expressions are more complex3. In the AST interpreter we chucked the variables in a map, but we cannot do that in a VM. There is no environment map in a VM, and all values must reside in the stack. How do we have variables at all then? Let’s think for a bit.

Each expression, after being evaluated in the VM, must push exactly one value on the stack: its result. Num expressions are a trivial case. When a binary operation is evaluated, first its left operand is evaluated. That pushes one value on the stack. Then its right operand is evaluated, and that pushes another value on the stack. Finally, the operation pops the two values from the top of the stack, does its thing, and pushes the resultant value back on the stack—again one value for the entire BinOp expression.

A Let expression binds a variable’s value to its name, and then the variable can be referred from the body of the expression. But how can we refer to a variable when the stack contains only values, not names? Let’s imagine that we are in middle of evaluating a large expression, wherein we encounter a Let expression. First we evaluate its assignment expression, and that leaves a value on the top of the stack. Let’s say that the stack has n values at this point. After this we get to evaluate the body expression. At all times when we are doing that, the value from assignment stays at the same point in the stack because evaluating sub-expressions, no matter how complicated, only adds new values to the stack, without popping an existing value from before. Therefore, we can use the stack index of the assignment value (n−1) to refer to it within the body expression. So, we encode Var as an opcode and an integer index into the stack.

We choose to use a Word8 to index the stack, limiting us to a stack depth of 256. We encode the variable references with an opcode OGet, which when executed gets the value from the stack at the given index and pushes it on the stack.

For a Let expression, after we compile its assignment and body expressions, we need to make sure that the exactly-one-value invariant holds. Evaluating the assignment and body pushes two values on the stack, but we can have only one! So we overwrite the assignment value with the body value, and pop the stack to remove the body value. We invent a new opcode OSwapPop to do this, called so because its effect is equivalent to swapping the topmost two values on the stack, and then popping the new top value4.

Putting all the opcodes together, we have the Opcode ADT:

data Opcode
  = OPush !Int16        -- 0
  | OSwapPop            -- 1
  | OGet !Word8         -- 2
  | OAdd                -- 3
  | OSub                -- 4
  | OMul                -- 5
  | ODiv                -- 6
  deriving (Show, Read, Eq, Generic)

instance NFData Opcode
ArithVMLib.hs

Notice that we also assigned bytecodes—that is, a unique byte value—to each Opcode above, which are just their ordinals. Now we are ready to write the compiler.

The Compiler

The compiler takes an expression with the bytecode size, and compiles it to a strict ByteString of that size. Recall that in the previous post, we wrote our parser such that the bytecode size for each AST node was calculated while parsing it. This allows us to pre-allocate a bytestring of required size before compiling the AST. We compile to actual bytes here, and don’t use the opcodes.

type Bytecode = BS.ByteString

compile :: SizedExpr -> Result Bytecode
compile = compile' defaultStackSize

compile' :: Int -> SizedExpr -> Result Bytecode
compile' stackSize (expr, bytecodeSize) =
  uncurry (fmap . const) . BSI.unsafeCreateUptoN' bytecodeSize $ \fp -> do
    (bytecodeSize,)
      <$> fmap Right (go Map.empty 0 fp fp expr >>= checkSize fp . TS.fst)
        `catch` (pure . Left)
  where
    go env !sp fp !ip = \case
      Num _ | sp + 1 > stackSize -> throwCompileError "Stack overflow"
      Num n -> do
        let !lb = fromIntegral $ n .&. 0xff
            !mb = fromIntegral $ ((fromIntegral n :: Word16) .&. 0xff00) `shiftR` 8
        writeByte ip 0 -- OPush
        writeByte (ip `plusPtr` 1) lb
        writeByte (ip `plusPtr` 2) mb
        pure (ip `plusPtr` 3 :!: sp + 1)
      BinOp op a b -> do
        (ip' :!: sp') <- go env sp fp ip a
        (ip'' :!: sp'') <- go env sp' fp ip' b
        writeByte ip'' $ translateOp op
        pure (ip'' `plusPtr` 1 :!: sp'' - 1)
      Let x assign body -> do
        (ip' :!: sp') <- go env sp fp ip assign
        (ip'' :!: sp'') <- go (Map.insert x sp env) sp' fp ip' body
        writeByte ip'' 1 -- OSwapPop
        pure (ip'' `plusPtr` 1 :!: sp'' - 1)
      Var _ | sp + 1 > stackSize -> throwCompileError "Stack overflow"
      Var x -> case Map.lookup x env of
        Nothing ->
          throwCompileError $ "Unknown variable: " <> BSC.unpack (unIdent x)
        Just varScope
          | varScope < stackSize && varScope < fromIntegral (maxBound @Word8) -> do
              writeByte ip 2 -- OGet
              writeByte (ip `plusPtr` 1) $ fromIntegral varScope
              pure (ip `plusPtr` 2 :!: sp + 1)
        Just _ -> throwCompileError "Stack overflow"
      where
        ep = fp `plusPtr` bytecodeSize

        writeByte :: Ptr Word8 -> Word8 -> IO ()
        writeByte !ip !val
          | ip < ep = poke ip val
          | otherwise = throwCompileError $
              "Instruction index " <> show (ip `minusPtr` fp) <>
              " out of bound " <> show (bytecodeSize - 1)

    translateOp = \case
      Add -> 3 -- OAdd
      Sub -> 4 -- OSub
      Mul -> 5 -- OMul
      Div -> 6 -- ODiv

    checkSize fp ip = do
      let actualBytecodeSize = ip `minusPtr` fp
      unless (actualBytecodeSize == bytecodeSize) $
        throwCompileError $
          ("Compiled bytecode size " <> show actualBytecodeSize)
            <> (" is not same as expected size: " <> show bytecodeSize)

    throwCompileError = throwIO . Error Compile

defaultStackSize :: Int
defaultStackSize = 256
ArithVMLib.hs

We use the unsafeCreateUptoN' function from the Data.ByteString.Internal module that allocates enough memory for the provided bytecode size, and gives us a pointer to the allocated memory. We call this pointer fp for frame pointer. Then we traverse the AST recursively, writing bytes for opcodes and arguments for each case. We use pointer arithmetic and the poke function to write the bytes. Int16 numbers are encoded as two bytes in little endian fashion.

In the recursive traversal function go, we pass and return the current stack pointer sp and instruction pointer ip, along with the frame pointer fp. We update these correctly for each case5. We also take care of checking that the pointers stay in the right bounds, failing which we throw appropriate errors.

We also pass an env parameter that is similar to the variable names to values environment we use in the AST interpreter, but this one tracks variable names to stack indices at which they reside. We update this information before compiling the body of a Let expression to capture the stack index of its assignment value. When compiling a Var expression, we use the env map to lookup the variable’s stack index, and encode it in the bytecode.

At the end of compilation, we check that the entire bytestring is filled with bytes till the very end, failing which we throw an error. This check is required because otherwise the bytestring may have garbage bytes, and may fail inexplicably during execution.

All the errors are thrown in the IO monad using the throwIO function, and are caught after compilation using the catch function. The final result or error is returned wrapped into Result.

Let’s see it in action:

$ echo -n "1 + 2 - 3 * 4" | arith-vm compile | hexdump -C
00000000  00 01 00 00 02 00 03 00  03 00 00 04 00 05 04     |...............|
0000000f
$ echo -n "let x = 4 in let y = 5 in x + y" | arith-vm compile | hexdump -C
00000000  00 04 00 00 05 00 02 00  02 01 03 01 01           |.............|
0000000d

You can verify that the resultant bytes are indeed correct. I assume that it is difficult for you to read raw bytes. We’ll fix this in a minute. Meanwhile, let’s ponder upon some performance characteristics of our compiler.

Compiling, Fast and Slow

You may be wondering why I chose to write the compiler in this somewhat convoluted way of pre-allocating a bytestring and using pointers. The answer is: performance. I didn’t actually start with pointers. I iterated through many different data and control structures to find the fastest one.

The table below shows the compilation times for a benchmark expression file when using different data structures to implement the compile' function:

Data structure Time (ms) Incremental speedup Overall speedup
List 4345 1x 1x
Seq 523 8.31x 8.31x
DList 486 1.08x 8.94x
BS Builder 370 1.31x 11.74x
Pre-allocated BS 54 6.85x 80.46x
Bytearray 52 1.02x 83.55x

I started with the bread-and-butter data structure of Haskellers, the humble and known to be slow List, which was indeed quite slow. Next, I moved on to Seq and thereafter DList, which are known to be faster at concatenation/consing. Then I abandoned the use of intermediate data structures completely, choosing to use a bytestring Builder to create the bytestring. Finally, I had the epiphany that the bytestring size was known at compile time, and rewrote the function to pre-allocate the bytestring, thereby reaching the fastest solution.

I also tried using Bytearray, which has more-or-less same performance of bytestring, but it is inconvenient to use because there are no functions to do IO with bytearrays. So I’d anyway need to use bytestrings for reading from STDIN or writing to STDOUT, and converting to-and-fro between bytearray and bytestring is a performance killer. Thus, I decided to stick to bytestrings.

The pre-allocated bytestring approach is 80 times faster than using lists, and almost 10 times faster then using Seq. For such gain, I’m okay with the complications it brings to the code. Here are the numbers in a chart (smaller is better):

Compilation time using different data-structures
Compilation time using different data-structures

Another choice I had to make was how to write the go function. I ended up passing and returning pointers and environment map, and throwing errors in IO, but a number of solutions are possible. I tried out some of them, and noted the compilation times for the benchmark expression file:

Control structure Time (ms) Slowdown
IO 57.4 1.00x
IO + IORef 65.0 1.13x
IO + ReaderT 60.9 1.06x
IO + StateT 65.6 1.14x
IO + ExceptT 65.9 1.15x
IO + ReaderT + ExceptT 107.1 1.87x
IO + StateT + ExceptT 383.9 6.69x
IO + StateT + ReaderT 687.5 11.98x
IO + StateT + ReaderT + ExceptT 702.0 12.23x
IO + CPS 78.2 1.36x
IO + DCPS 78.4 1.37x
IO + ContT 76.5 1.33x

I tried putting the pointer in IORefs and StateT state instead of passing them back-and-forth. I tried putting the environment in a ReaderT config. I tried using ExceptT for throwing errors instead of using IO errors. Then I tried various combinations of these monad transformers.

Finally, I also tried converting the go function to be tail-recursive by using Continuation-passing style (CPS), and then defunctionalizing the continuations, as well as, using the ContT monad transformer. All of these approaches resulted in slower code. The times are interesting to compare (smaller is better):

Compilation time using different control-structures
Compilation time using different control-structures

There is no reason to use IORefs here because they result in slower and uglier code. Using one monad transformer at a time results in slight slowdowns, which may be worth the improvement in the code. But using more than one of them degrades performance by a lot. Also, there is no improvement caused by CPS conversion, because GHC is smart enough to optimize the non tail-recursive code to be faster then handwritten tail-recursive one that allocates a lot of closures (or objects in case of defunctionalization).

Moving on …

The Decompiler

It is a hassle to read raw bytes in the compiler output. Let’s write a decompiler to aid us in debugging and testing the compiler. First, a disassembler that converts bytes to opcodes:

type Program = Seq Opcode

disassemble :: Bytecode -> Result Program
disassemble bytecode = go 0 Seq.empty
  where
    !size = BS.length bytecode

    go !ip !program
      | ip == size = pure program
      | otherwise = case readInstr bytecode ip of
          0 | ip + 2 < size ->
            go (ip + 3) $ program |> OPush (readInstrArgInt16 bytecode ip)
          0 -> throwIPOOBError $ ip + 2
          1 -> go (ip + 1) $ program |> OSwapPop
          2 | ip + 1 < size ->
            go (ip + 2) $ program |> OGet (readInstrArgWord8 bytecode ip)
          2 -> throwIPOOBError $ ip + 1
          3 -> go (ip + 1) $ program |> OAdd
          4 -> go (ip + 1) $ program |> OSub
          5 -> go (ip + 1) $ program |> OMul
          6 -> go (ip + 1) $ program |> ODiv
          n -> throwDisassembleError $
            "Invalid bytecode: " <> show n <> " at: " <> show ip

    throwIPOOBError ip = throwDisassembleError $
      "Instruction index " <> show ip <> " out of bound " <> show (size - 1)

    throwDisassembleError = throwError . Error Disassemble
ArithVMLib.hs

A disassembled program is a sequence of opcodes. We simply go over each byte of the bytecode, and append the right opcode for it to the program, along with any parameters it may have. Note that we do not verify that the disassembled program is correct.

Here are the helpers that read instruction bytes and their arguments from a bytestring:

readInstr :: BS.ByteString -> Int -> Word8
readInstr = BS.unsafeIndex
{-# INLINE readInstr #-}

readInstrArgWord8 :: BS.ByteString -> Int -> Word8
readInstrArgWord8 bytecode ip = readInstr bytecode (ip + 1)
{-# INLINE readInstrArgWord8 #-}

readInstrArgInt16 :: BS.ByteString -> Int -> Int16
readInstrArgInt16 bytecode ip =
  let lb = readInstr bytecode (ip + 1)
      mb = readInstr bytecode (ip + 2)
      b1 :: Word16 = fromIntegral lb
      b2 = fromIntegral mb `shiftL` 8
   in fromIntegral (b1 .|. b2)
{-# INLINE readInstrArgInt16 #-}
ArithVMLib.hs

Next, we decompile the opcodes to an expression:

decompile :: Program -> Result Expr
decompile program = do
  stack <- go Seq.empty program
  checkStack Decompile maxBound $ length stack
  let ast :<| _ = stack
  pure ast
  where
    go stack = \case
      Seq.Empty -> pure stack
      opcode :<| rest -> case opcode of
        OPush n -> go (stack |> Num n) rest
        OAdd -> decompileBinOp Add >>= flip go rest
        OSub -> decompileBinOp Sub >>= flip go rest
        OMul -> decompileBinOp Mul >>= flip go rest
        ODiv -> decompileBinOp Div >>= flip go rest
        OGet i -> go (stack |> Var (mkIdent $ mkName $ fromIntegral i)) rest
        OSwapPop -> decompileLet >>= flip go rest
      where
        decompileBinOp op = case stack of
          stack' :|> a :|> b -> pure $ stack' |> BinOp op a b
          _ -> throwDecompileError $
            "Not enough elements to decompile binary operation: " <> show op

        decompileLet = case stack of
          stack' :|> a :|> b ->
            pure $ stack' |> Let (mkIdent $ mkName $ length stack - 2) a b
          _ -> throwDecompileError "Not enough elements to decompile let"

    mkName i = names `Seq.index` i
    names = Seq.fromList $ tail $ combinations 2

    combinations = \case
      0 -> [""]
      n -> let prev = combinations (n - 1)
        in prev <> [x : xs | x <- ['a' .. 'z'], xs <- prev]

    throwDecompileError = throwError . Error Decompile

checkStack :: (MonadError Error m) => Pass -> Int -> Int -> m ()
checkStack pass stackSize = \case
  1 -> pure ()
  0 -> throwError $ Error pass "Final stack has no elements"
  n | n > stackSize -> throwError . Error pass $ "Stack overflow"
  n | n > 1 -> throwError . Error pass $ "Final stack has more than one element"
  _ -> throwError . Error pass $ "Stack underflow"
ArithVMLib.hs

Decompilation is the opposite of compilation. While compiling there is an implicit stack of expressions that are yet to be compiled. We make that stack explicit here, capturing expressions as they are decompiled from opcodes. For compound expressions, we inspect the stack and use the already decompiled expressions as the operands of the expression being decompiled. This way we build up larger expressions from smaller ones, culminating in the single top-level expression at the end. Finally, we check the stack to make sure that there is only one expression left in it. Note that like the disassembler, we do not verify that the decompiled expression is correct.

There is one tricky thing in decompilation: we lose the names of the variables when compiling, and are left with only stack indices. So while decompiling, we generate variable names from their stack indices by indexing a list of unique names.

That’s all for compilation and decompilation. Now, we use them together to make sure that everything works.

Testing the Compiler

We write some unit tests for the compiler, targeting both success and failure cases:

compilerSpec :: Spec
compilerSpec = describe "Compiler" $ do
  forM_ compilerSuccessTests $ \(input, result) ->
    it ("compiles: \"" <> BSC.unpack input <> "\"") $ do
      parseCompile input `shouldBe` Right (Seq.fromList result)

  forM_ compilerErrorTests $ \(input, err) ->
    it ("fails for: \"" <> BSC.unpack input <> "\"") $ do
      parseCompile input `shouldSatisfy` \case
        Left (Error Compile msg) | err == msg -> True
        _ -> False

  it "fails for greater sized expr" $ do
    compile (Num 1, 4) `shouldSatisfy` \case
      Left
        ( Error Compile "Compiled bytecode size 3 is not same as expected size: 4"
        ) -> True
      _ -> False

  it "fails for lesser sized expr" $ do
    compile (Num 1, 2) `shouldSatisfy` \case
      Left (Error Compile "Instruction index 2 out of bound 1") -> True
      _ -> False
  where
    parseCompile = parseSized >=> compile' 4 >=> disassemble

compilerSuccessTests :: [(BSC.ByteString, [Opcode])]
compilerSuccessTests =
  [ ( "1",
      [OPush 1]
    ),
    ( "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1",
      [ OPush 1, OPush 2, OAdd, OPush 3, OPush 4, OMul, OSub, OPush 5, OPush 6,
        ODiv, OPush 1, ODiv, OAdd, OPush 1, OAdd ]
    ),
    ( "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)",
      [ OPush 1, OPush 2, OPush 3, OSub, OPush 4, OMul, OAdd, OPush 5, OPush 6,
        ODiv, OPush 1, OPush 1, OAdd, ODiv, OAdd ]
    ),
    ( "let x = 4 in x + 1",
      [OPush 4, OGet 0, OPush 1, OAdd, OSwapPop]
    ),
    ( "let x = 4 in let y = 5 in x + y",
      [OPush 4, OPush 5, OGet 0, OGet 1, OAdd, OSwapPop, OSwapPop]
    ),
    ( "let x = 4 in let x = x + 1 in x + 2",
      [OPush 4, OGet 0, OPush 1, OAdd, OGet 1, OPush 2, OAdd, OSwapPop, OSwapPop]
    ),
    ( "let x = let y = 3 in y + y in x * 3",
      [ OPush 3, OGet 0, OGet 0, OAdd, OSwapPop, OGet 0, OPush 3, OMul, OSwapPop ]
    ),
    ( "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3",
      [ OPush 1, OPush 2, OGet 1, OGet 1, OMul, OSwapPop, OAdd, OGet 0, OPush 1,
        OAdd, OSwapPop, OGet 0, OPush 3, OMul, OSwapPop ]
    ),
    ("1/0", [OPush 1, OPush 0, ODiv]),
    ("-32768 / -1", [OPush (-32768), OPush (-1), ODiv])
  ]

compilerErrorTests :: [(BSC.ByteString, String)]
compilerErrorTests =
  [ ("x", "Unknown variable: x"),
    ("let x = 4 in y + 1", "Unknown variable: y"),
    ("let x = y + 1 in x", "Unknown variable: y"),
    ("let x = x + 1 in x", "Unknown variable: x"),
    ("let x = 4 in let y = 1 in let z = 2 in y + x", "Stack overflow"),
    ("let x = 4 in let y = 5 in x + let z = y in z * z", "Stack overflow"),
    ("let a = 0 in let b = 0 in let c = 0 in let d = 0 in d", "Stack overflow")
  ]
ArithVMSpec.hs

In each test, we parse and compile an expression, and then disassemble the compiled bytes, which we match with expected list of opcodes, or an error message.

Let’s put these tests with the parser tests, and run them:

main :: IO ()
main = hspec $ do
  parserSpec
  astInterpreterSpec
  compilerSpec
ArithVMSpec.hs
Output of the test run
$ cabal test -O2
Running 1 test suites...
Test suite specs: RUNNING...

Parser
  parses: "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1" [✔]
  parses: "1+2-3*4+5/6/0+1" [✔]
  parses: "1 + -1" [✔]
  parses: "let x = 4 in x + 1" [✔]
  parses: "let x=4in x+1" [✔]
  parses: "let x = 4 in let y = 5 in x + y" [✔]
  parses: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
  parses: "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z" [✔]
  parses: "let x=4in 2+let y=x-5in x+let z=y+1in z/2" [✔]
  parses: "let x = (let y = 3 in y + y) in x * 3" [✔]
  parses: "let x = let y = 3 in y + y in x * 3" [✔]
  parses: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
  fails for: "" [✔]
  fails for: "1 +" [✔]
  fails for: "1 & 1" [✔]
  fails for: "1 + 1 & 1" [✔]
  fails for: "1 & 1 + 1" [✔]
  fails for: "(" [✔]
  fails for: "(1" [✔]
  fails for: "(1 + " [✔]
  fails for: "(1 + 2" [✔]
  fails for: "(1 + 2}" [✔]
  fails for: "66666" [✔]
  fails for: "-x" [✔]
  fails for: "let 1" [✔]
  fails for: "let x = 1 in " [✔]
  fails for: "let let = 1 in 1" [✔]
  fails for: "let x = 1 in in" [✔]
  fails for: "let x=1 inx" [✔]
  fails for: "letx = 1 in x" [✔]
  fails for: "let x ~ 1 in x" [✔]
  fails for: "let x = 1 & 2 in x" [✔]
  fails for: "let x = 1 inx" [✔]
  fails for: "let x = 1 in x +" [✔]
  fails for: "let x = 1 in x in" [✔]
  fails for: "let x = let x = 1 in x" [✔]
AST interpreter
  interprets: "1" [✔]
  interprets: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
  interprets: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
  interprets: "1 + -1" [✔]
  interprets: "1 * -1" [✔]
  interprets: "let x = 4 in x + 1" [✔]
  interprets: "let x = 4 in let x = x + 1 in x + 2" [✔]
  interprets: "let x = 4 in let y = 5 in x + y" [✔]
  interprets: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
  interprets: "let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z" [✔]
  interprets: "let x = let y = 3 in y + y in x * 3" [✔]
  interprets: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
  fails for: "x" [✔]
  fails for: "let x = 4 in y + 1" [✔]
  fails for: "let x = y + 1 in x" [✔]
  fails for: "let x = x + 1 in x" [✔]
  fails for: "1/0" [✔]
  fails for: "-32768 / -1" [✔]
Compiler
  compiles: "1" [✔]
  compiles: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
  compiles: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
  compiles: "let x = 4 in x + 1" [✔]
  compiles: "let x = 4 in let y = 5 in x + y" [✔]
  compiles: "let x = 4 in let x = x + 1 in x + 2" [✔]
  compiles: "let x = let y = 3 in y + y in x * 3" [✔]
  compiles: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
  compiles: "1/0" [✔]
  compiles: "-32768 / -1" [✔]
  fails for: "x" [✔]
  fails for: "let x = 4 in y + 1" [✔]
  fails for: "let x = y + 1 in x" [✔]
  fails for: "let x = x + 1 in x" [✔]
  fails for: "let x = 4 in let y = 1 in let z = 2 in y + x" [✔]
  fails for: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
  fails for: "let a = 0 in let b = 0 in let c = 0 in let d = 0 in d" [✔]
  fails for greater sized expr [✔]
  fails for lesser sized expr [✔]

Finished in 0.0147 seconds
73 examples, 0 failures
Test suite specs: PASS

Awesome, it works! That’s it for this post. Let’s update our checklist:

In the next part, we write a virtual machine that runs our compiled bytecode, and do some benchmarking.


  1. There are VMs that execute hardware ISs instead of bytecode. Such VMs are also called Emulators because they emulate actual CPU hardware. Some examples are QEMU and video game console emulators.↩︎

  2. VMs use virtual registers instead of actual CPU registers, which are often represented as a fixed size array of 1, 2, 4 or 8 byte elements.↩︎

  3. I call them variables here but they do not actually vary. A better name is let bindings.↩︎

  4. We could have used two separate opcodes here: OSwap and OPop. That would result in same final result when evaluating an expression, but we’d have to execute two instructions instead of one for Let expressions. Using a single OSwapPop instruction speeds up the execution, not only because we reduce the number of instructions, but also because we don’t need to do a full swap, only the half swap is enough because we pop the stack anyway after the swap. This also shows how we can improve the performance of our VMs by inventing specific opcodes for particular operations.↩︎

  5. Note the use of strict Pairs here, for performance reasons.↩︎

If you liked this post, please leave a comment.

by Abhinav Sarkar (abhinav@abhinavsarkar.net) at August 24, 2025 12:00 AM

August 23, 2025

Manuel M T Chakravarty

Functional data structures in Swift

One of the intriguing features of Swift is its distinction between value types and reference types. Conceptually, value types are always copied in assignments and passed-by-value in function calls — i.e., they are semantically immutable. In contrast, for reference types, Swift only copies a pointer to an object on an assignment and they are being passed-by-reference to functions. If such an object gets mutated, it changes for for all references. While most languages feature both value and reference types, Swift is unique in that (1) it makes it easy to define and use both flavours of types and (2) it supports fine-grained mutability control.

For large values, such as arrays, frequent copying carries a significant performance penalty. Hence, the Swift compiler goes to great length to avoid copying whenever it is safe. For large values, this effectively boils down to a copy-on-write strategy, where a large value is only copied when it actually is being mutated (on one code path). Swift facilitates for user-defined value types to also adopt this copy-on-write strategy.

In this talk, I will explain the semantic difference between value and reference types, and I will illustrate how this facilitates safe and robust coding practices in Swift. Moreover, I will explain how the copy-on-write strategy for large values works and how it interacts with Swift’s memory management system. Finally, I will demonstrate how you can define your own copy-on-write large value types.

August 23, 2025 04:19 PM

August 22, 2025

Derek Elkins

Arithmetic Functions

Introduction

I want to talk about one of the many pretty areas of number theory. This involves the notion of an arithmetic function and related concepts. A few relatively simple concepts will allow us to produce a variety of useful functions and theorems. This provides only a glimpse of the start of the field of analytic number theory, though many of these techniques are used in other places as we’ll also start to see.

(See the end for a summary of identities and results.)

Prelude

As some notation, I’ll write |\mathbb N_+| for the set of positive naturals, and |\mathbb P| for the set of primes. |\mathbb N| will contain |0|. Slightly atypically, I’ll write |[n]| for the set of numbers from |1| to |n| inclusive, i.e. |a \in [n]| if and only if |1 \leq a \leq n|.

I find that the easiest way to see results in number theory is to view a positive natural number as a multiset of primes which is uniquely given by factorization. Coprime numbers are ones where these multisets are disjoint. Multiplication unions the multisets. The greatest common divisor is multiset intersection. |n| divides |m| if and only if |n| corresponds to a sub-multiset of |m|, in which case |m/n| corresponds to the multiset difference. The multiplicity of an element of a multiset is the number of occurrences. For a multiset |P|, |\mathrm{dom}(P)| is the set of elements of the multiset |P|, i.e. those with multiplicity greater than |0|. For a finite multiset |P|, |\vert P\vert| will be the sum of the multiplicities of the distinct elements, i.e. the number of elements (with duplicates) in the multiset.

We can represent a multiset of primes as a function |\mathbb P \to \mathbb N| which maps an element to its multiplicity. A finite multiset would then be such a function that is |0| at all but finitely many primes. Alternatively, we can represent the multiset as a partial function |\mathbb P \rightharpoonup \mathbb N_+|. It will be finite when it is defined for only finitely many primes. Equivalently, when it is a finite subset of |\mathbb P\times\mathbb N_+| (which is also a functional relation).

Unique factorization provides a bijection between finite multisets of primes and positive natural numbers. Given a finite multiset |P|, the corresponding positive natural number is |n_P = \prod_{(p, k) \in P} p^k|.

I will refer to this view often in the following.

Arithmetic Functions

An arithmetic function is just a function defined on the positive naturals. Usually, they’ll land in (not necessarily positive) natural numbers, but that isn’t required.

In most cases, we’ll be interested in the specific subclass of multiplicative arithmetic functions. An arithmetic function, |f|, is multiplicative if |f(1) = 1| and |f(ab) = f(a)f(b)| whenever |a| and |b| are coprime. We also have the notion of a completely multiplicative arithmetic function for which |f(ab) = f(a)f(b)| always. Obviously, completely multiplicative functions are multiplicative. Analogously, we also have a notion of (completely) additive where |f(ab) = f(a) + f(b)|. Warning: In other mathematical contexts, “additive” means |f(a+b)=f(a)+f(b)|. An obvious example of a completely additive function being the logarithm. Exponentiating an additive function will produce a multiplicative function.

For an additive function, |f|, we automatically get |f(1) = 0| since |f(1) = f(1\cdot 1) = f(1) + f(1)|.

Lemma: The product of two multiplicative functions |f| and |g| is multiplicative.
Proof: For |a| and |b| coprime, |f(ab)g(ab) = f(a)f(b)g(a)g(b) = f(a)g(a)f(b)g(b)|. |\square|

A parallel statement holds for completely multiplicative functions.

It’s also clear that a completely multiplicative function is entirely determined by its action on prime numbers. Since |p^n| is coprime to |q^n| whenever |p| and |q| are coprime, we see that a multiplicative function is entirely determined by its action on powers of primes. To this end, I’ll often define multiplicative/additive functions by their action on prime powers and completely multiplicative/additive functions by their action on primes.

Multiplicative functions aren’t closed under composition, but we do have that if |f| is completely multiplicative and |g| is multiplicative, then |f \circ g| is multiplicative when that composite makes sense.

Here are some examples. Not all of these will be used in the sequel.

  • The power function |({-})^z| for any |z|, not necessarily an integer, is completely multiplicative.
  • Choosing |z=0| in the previous, we see the constantly one function |\bar 1(n) = 1| is completely multiplicative.
  • The identity function is clearly completely multiplicative and is also the |z=1| case of the above.
  • The Kronecker delta function |\delta(n) = \begin{cases}1, & n = 0 \\ 0, & n \neq 0\end{cases}| is completely multiplicative. Often written |\varepsilon| in this context.
  • Define a multiplicative function via |\mu(p^n) = \begin{cases} -1, & n = 1 \\ 0, & n > 1\end{cases}| where |p| is prime. This is the Möbius function. More holistically, |\mu(n)| is |0| if |n| has any square factors, otherwise |\mu(n) = (-1)^k| where |k| is the number of (distinct) prime factors.
  • Define a completely multiplicative function via |\lambda(p) = -1|. |\lambda(n) = \pm 1| depending on whether there is an even or odd number of prime factors (including duplicates). This function is known as the Liouville function.
  • |\lambda(n) = (-1)^{\Omega(n)}| where |\Omega(n)| is the completely additive function which counts the number of prime factors of |n| including duplicates. |\Omega(n_P) = \vert P\vert|.
  • Define a multiplicative function via |\gamma(p^n) = -1|. |\gamma(n) = \pm 1| depending on whether there is an even or odd number of distinct prime factors.
  • |\gamma(n) = (-1)^{\omega(n)}| where |\omega(n)| is the additive function which counts the number of distinct prime factors of |n|. See Prime omega function. We also see that |\omega(n_P) = \vert\mathrm{dom}(P)\vert|.
  • The completely additive function for |q\in\mathbb P|, |\nu_q(p) = \begin{cases}1,&p=q\\0,&p\neq q\end{cases}| is the p-adic valuation.
  • It follows that the |p|-adic absolute value |\vert r\vert_p = p^{-\nu_p(r)}| is completely multiplicative. It can be characterized on naturals by |\vert p\vert_q = \begin{cases}p^{-1},&p=q\\1,&p\neq q\end{cases}|.
  • |\gcd({-}, k)| for a fixed |k| is multiplicative. Given any multiplicative function |f|, |f \circ \gcd({-},k)| is multiplicative. This essentially “restricts” |f| to only see the prime powers that divide |k|. Viewing the finite multiset of primes |P| as a function |\mathbb P\to\mathbb N|, |f(\gcd(p^n,n_P)) = \begin{cases}f(p^n),&n\leq P(p)\\f(p^{P(p)}),&n>P(p)\end{cases}|.
  • The multiplicative function characterized by |a(p^n) = p(n)| where |p(n)| is the partition function counts the number of abelian groups the given order. That this function is multiplicative is a consequence of the fundamental theorem of finite abelian groups.
  • The Jacobi symbol |\left(\frac{a}{n}\right)| where |a\in\mathbb Z| and |n| is an odd positive integer is a completely multiplicative function with either |a| or |n| fixed. When |n| is an odd prime, it reduces to the Legendre symbol. For |p| an odd prime, we have |(\frac{a}{p}) = a^{\frac{p-1}{2}} \pmod p|. This will always be in |\{-1, 0, 1\}| and can be alternately defined as |\left(\frac{a}{p}\right) = \begin{cases}0,&p\mid a\\1,&p\nmid a\text{ and }\exists x.x^2\equiv a\pmod p\\-1,&\not\exists x.x^2\equiv a\pmod p\end{cases}|. Therefore, |\left(\frac{a}{p}\right)=1| (|=0|) when |a| is a (trivial) quadratic residue mod |p|.
  • An interesting example which is not multiplicative nor additive is the arithmetic derivative. Let |p\in\mathbb P|. Define |\frac{\partial}{\partial p}(n)| via |\frac{\partial}{\partial p}(p) = 1|, |\frac{\partial}{\partial p}(q) = 0| for |q\neq p| and |q\in\mathbb P|, and |\frac{\partial}{\partial p}(nm) = \frac{\partial}{\partial p}(n)m + n\frac{\partial}{\partial p}(m)|. We then have |D_S = \sum_{p\in S}\frac{\partial}{\partial p}| for non-empty |S\subseteq\mathbb P| which satisfies the same product rule identity. This perspective views a natural number (or, more generally, a rational number) as a monomial in infinitely many variables labeled by prime numbers.
  • A Dirichlet character of modulus |m| is, by definition, a completely multiplicative function |\chi| satisfying |\chi(n + m) = \chi(n)| and |\chi(n)| is non-zero if and only if |n| is coprime to |m|. The Jacobi symbol |\left(\frac{({-})}{m}\right)| is a Dirichlet character of modulus |m|. |\bar 1| is the Dirichlet character of modulus |1|.

Dirichlet Series

Given an arithmetic function |f|, we define the Dirichlet series:

\[\mathcal D[f](s) = \sum_{n=1}^\infty \frac{f(n)}{n^s} = \sum_{n=1}^\infty f(n)n^{-s}\]

When |f| is a Dirichlet character, |\chi|, this is referred to as the (Dirichlet) |L|-series of the character, and the analytic continuation is the (Dirichlet) |L|-function and is written |L(s, \chi)|.

We’ll not focus much on when such a series converges. See this section of the above Wikipedia article for more details. Alternatively, we could talk about formal Dirichlet series. We can clearly see that if |s = 0|, then we get the sum |\sum_{n=1}^\infty f(n)| which clearly won’t converge for, say, |f = \bar 1|. We can say that if |f| is asymptotically bounded by |n^k| for some |k|, i.e. |f \in O(n^k)|, then the series will converge absolutely when the real part of |s| is greater than |k+1|. For |\bar 1|, it follows that |\mathcal D[\bar 1](x + iy)| is defined when |x > 1|. We can use analytic continuation to go beyond these limits.

See A Catalog of Interesting Dirichlet Series for a more reference-like listing. Beware differences in notation.

Dirichlet Convolution

Why is this interesting in this context? Let’s consider two arithmetic functions |f| and |g| and multiply their corresponding Dirichlet series. We’ll get:

\[\mathcal D[f](s)\mathcal D[g](s) = \sum_{n=1}^\infty h(n)n^{-s} = \mathcal D[h](s)\]

where now we need to figure out what |h(n)| is. But |h(n)| is going to be the sum of all the terms of the form |f(a)a^{-s}g(b)b^{-s} = f(a)g(b)(ab)^{-s}| where |ab = n|. We can thus write: \[h(n) = \sum_{ab=n} f(a)g(b) = \sum_{d\mid n} f(d)g(n/d)\] We’ll write this more compactly as |h = f \star g| which we’ll call Dirichlet convolution. We have thus shown a convolution theorem of the form \[\mathcal D[f]\mathcal D[g] = \mathcal D[f \star g]\]

The Kronecker delta serves as a unit to this operation which is reflected by |\mathcal D[\delta](s) = 1|.

Since we will primarily be interested in multiplicative functions, we should check that |f \star g| is a multiplicative function when |f| and |g| are.

Lemma: Assume |a| and |b| are coprime, and |f| and |g| are multiplicative. Then |(f \star g)(ab) = (f \star g)(a)(f \star g)(b)|.

Proof: Since |a| and |b| are coprime, they share no divisors besides |1|. This means every |d| such that |d \mid ab| factors as |d = d_a d_b| where |d_a \mid a| and |d_b \mid b|. More strongly, write |D_n = \{ d \in \mathbb N_+ \mid d \mid n\}|, then for any coprime pair of numbers |i| and |j|, we have |D_{ij} \cong D_i \times D_j| and that every pair |(d_i, d_j) \in D_i \times D_j| are coprime1. Thus,

\[\begin{flalign} (f \star g)(ab) & = \sum_{d \in D_{ab}} f(d)g((ab)/d) \tag{by definition} \\ & = \sum_{(d_a, d_b) \in D_a \times D_b} f(d_a d_b)g((ab)/(d_a d_b)) \tag{via the bijection} \\ & = \sum_{(d_a, d_b) \in D_a \times D_b} f(d_a)f(d_b)g(a/d_a)g(b/d_b) \tag{f and g are multiplicative} \\ & = \sum_{d_a \in D_a} \sum_{d_b \in D_b} f(d_a)f(d_b)g(a/d_a)g(b/d_b) \tag{sum over a Cartesian product} \\ & = \sum_{d_a \in D_a} f(d_a)g(a/d_a) \sum_{d_b \in D_b} f(d_b)g(b/d_b) \tag{undistributing} \\ & = \sum_{d_a \in D_a} f(d_a)g(a/d_a) (f \star g)(b) \tag{by definition} \\ & = (f \star g)(b) \sum_{d_a \in D_a} f(d_a)g(a/d_a) \tag{undistributing} \\ & = (f \star g)(b) (f \star g)(a) \tag{by definition} \\ & = (f \star g)(a) (f \star g)(b) \tag{commutativity of multiplication} \end{flalign}\] |\square|

It is not the case that the Dirichlet convolution of two completely multiplicative functions is completely multiplicative.

We can already start to do some interesting things with this. First, we see that |\mathcal D[\bar 1] = \zeta|, the Riemann zeta function. Now consider |(\bar 1 \star \bar 1)(n) = \sum_{k \mid n} 1 = d(n)|. |d(n)| is the divisor function which counts the number of divisors of |n|. We see that |\mathcal D[d](s) = \zeta(s)^2|. A simple but useful fact is |\zeta(s - z) = \mathcal D[(-)^z](s)|. This directly generalizes the result for |\mathcal D[\bar 1]| and also implies |\mathcal D[\operatorname{id}](s) = \zeta(s - 1)|.

Generalizing in a different way, we get the family of functions |\sigma_k = ({-})^k \star \bar 1|. |\sigma_k(n) = \sum_{d \mid n} d^k|. From the above, we see |\mathcal D[\sigma_k](s) = \zeta(s - k)\zeta(s)|.

Lemma: Given a completely multiplicative function |f|, we get |f(n)(g \star h)(n) = (fg \star fh)(n)|.
Proof: \[\begin{flalign} (fg \star fh)(n) & = \sum_{d \mid n} f(d)g(d)f(n/d)h(n/d) \\ & = \sum_{d \mid n} f(d)f(n/d)g(d)h(n/d) \\ & = \sum_{d \mid n} f(n)g(d)h(n/d) \\ & = f(n)\sum_{d \mid n} g(d)h(n/d) \\ & = f(n)(g \star h)(n) \end{flalign}\] |\square|

As a simple corollary, for a completely multiplicative |f|, |f \star f = f(\bar 1 \star \bar 1) = fd|.

Euler Product Formula

However, the true power of this is unlocked by the following theorem:

Theorem (Euler product formula): Given a multiplicative function |f| which doesn’t grow too fast, e.g. is |O(n^k)| for some |k > 0|, \[\mathcal D[f](s) = \sum_{n=1}^\infty f(n)n^{-s} = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f(p^n)p^{-ns} = \prod_{p \in \mathbb P}\left(1 + \sum_{n=1}^\infty f(p^n)p^{-ns}\right) \] where the series converges.

Proof: The last equality is simply using the fact that |f(p^0)p^0 = f(1) = 1| because |f| is multiplicative. The idea for the main part is similar to how we derived Dirichlet convolution. When we start to distribute out the infinite product, each term will correspond to the product of selections of a term from each series. When all but finitely many of those selections select the |1| term, we get |\prod_{(p, k) \in P}f(p^k)(p^k)^{-s}| where |P| is some finite multiset of primes induced by those selections. Therefore, |\prod_{(p, k) \in P}f(p^k)(p^k)^{-s} = f(n_P)n_P^{-s}|. Thus, by unique factorization, |f(n)n^{-s}| for every positive natural occurs in the sum produced by distributing the right-hand side exactly once.

In the case where |P| is not a finite multiset, we’ll have \[ \frac{\prod_{(p, k) \in P}f(p^k)}{\left(\prod_{(p, k) \in P}p^k\right)^s}\]

The denominator of this expression goes to infinity when the real part of |s| is greater than |0|. As long as the numerator doesn’t grow faster than the denominator (perhaps after restricting the real part of |s| to be greater than some bound), then this product goes to |0|. Therefore, the only terms that remain are these corresponding to the Dirichlet series on the left-hand side. |\square|

If we assume |f| is completely multiplicative, we can further simplify Euler’s product formula via the usual sum of a geometric series, |\sum_{n=0}^\infty x^n = (1-x)^{-1}|, to:

\[ \sum_{n=1}^\infty f(n)n^{-s} = \prod_{p \in \mathbb P}\sum_{n=0}^\infty (f(p)p^{-s})^n = \prod_{p \in \mathbb P}(1 - f(p)p^{-s})^{-1} \]

Now let’s put this to work. The first thing we can see is |\zeta(s) = \mathcal D[\bar 1](s) = \prod_{p\in\mathbb P}(1 - p^{-s})^{-1}|. But this lets us write |1/\zeta(s) = \prod_{p\in\mathbb P}(1 - p^{-s})|. If we look for a multiplicative function that would produce the right-hand side, we see that it must send a prime |p| to |-1| and |p^n| for |n > 1| to |0|. In other words, it’s the Möbius function |\mu| we defined before. So |\mathcal D[\mu](s) = 1/\zeta(s)|.

Using |\mathcal D[d](s) = \zeta(s)^2|, we see that \[\begin{flalign} \zeta(s)^2 & = \prod_{p\in\mathbb P}\left(\sum_{n=0}^\infty p^{-ns}\right)^{-2} \\ & = \prod_{p\in\mathbb P}\left(\sum_{n=0}^\infty (n+1)p^{-ns}\right)^{-1} \\ & = \prod_{p\in\mathbb P}\left(\sum_{n=0}^\infty d(p^n)p^{-ns}\right)^{-1} \\ & = \mathcal D[d](s) \end{flalign}\] Therefore, |d(p^n) = n + 1|. This intuitively makes sense because the only divisors of |p^n| are |p^k| for |k = 0, \dots, n|, and for |a| and |b| coprime |d(ab) = \vert D_{ab} \vert = \vert D_a \times D_b\vert = \vert D_a\vert\vert D_b\vert = d(a)d(b)|.

Another result leveraging the theorem is given any multiplicative function |f|, we can define a new multiplicative function via |f^{[k]}(p^n) = \begin{cases}f(p^m), & km = n\textrm{ for }m\in\mathbb N \\ 0, & k \nmid n\end{cases}|.

Lemma: The operation just defined has the property that |\mathcal D[f^{[k]}](s) = \mathcal D[f](ks)|.
Proof: \[\begin{flalign} \mathcal D[f^{[k]}](s) & = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f^{[k]}(p^n)p^{-ns} \\ & = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f^{[k]}(p^{kn})p^{-nks} \\ & = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f(p^n)p^{-nks} \\ & = \mathcal D[f](ks) \end{flalign}\] |\square|

Möbius Inversion

We can write a sum over some function, |f|, of the divisors of a given natural |n| as |(f \star \bar 1)(n) = \sum_{d \mid n} f(d)|. Call this |g(n)|. But then we have |\mathcal D[f \star \bar 1] = \mathcal D[f]\mathcal D[\bar 1] = \mathcal D[f]\zeta| and thus |\mathcal D[f] = \mathcal D[f]\zeta/\zeta = \mathcal D[(f \star \bar 1) \star \mu]|. Therefore, if we only have the sums |g(n) = \sum_{d \mid n} f(d)| for some unknown |f|, we can recover |f| via |f(n) = (g \star \mu)(n) = \sum_{d\mid n}g(d)\mu(n/d)|. This is Möbius inversion.

As a simple example, we clearly have |\zeta(s)/\zeta(s) = 1 = \mathcal D[\delta](s)| so |\bar 1 \star \mu = \delta| or |\sum_{d \mid n}\mu(d) = 0| for |n > 1| and |1| when |n = 1|.

We also get generalized Möbius inversion via |\delta(n) = \delta(n)n^k = (\mu\star\bar 1)(n)n^k = (({-})^k\mu\star({-})^k)(n)|. Which is to say if |g(n) = \sum_{d\mid n}d^k f(n/d)| then |f(n) = \sum_{d\mid n} \mu(d)d^kg(n/d)|.

By considering logarithms, we also get a multiplicative form of (generalized) Möbius inversion: \[g(n) = \prod_{d\mid n}f(n/d)^{d^k} \iff f(n) = \prod_{d\mid n}g(n/d)^{\mu(d)d^k}\]

Theorem: As another guise of Möbius inversion, given any completely multiplicative function |h|, let |g(m) = \sum_{n=1}^\infty f(mh(n))|. Assuming these sums make sense, we can recover |f(k)| via |f(k) = \sum_{m=1}^\infty \mu(m)g(kh(m))|.

Proof: \[\begin{align} \sum_{m=1}^\infty \mu(m)g(kh(m)) & = \sum_{m=1}^\infty \mu(m)\sum_{n=1}^\infty f(kh(m)h(n)) \\ & = \sum_{N=1}^\infty \sum_{N=mn} \mu(m)f(kh(N)) \\ & = \sum_{N=1}^\infty f(kh(N)) \sum_{N=nm} \mu(m) \\ & = \sum_{N=1}^\infty f(kh(N)) (\mu\star\bar 1)(N) \\ & = \sum_{N=1}^\infty f(kh(N)) \delta(N) \\ & = f(k) \end{align}\] |\square|

This will often show up in the form of |r(x^{1/n})| or |r(x^{1/n})/n|, i.e. with |h(n)=n^{-1}| and |f_x(k) = r(x^k)| or |f_x(k) = kr(x^k)|. Typically, we’ll then be computing |f_x(1) = r(x)|.

Lambert Series

As a brief aside, it’s worth mentioning Lambert Series.

Given an arithmetic function |a|, these are series of the form: \[ \sum_{n=1}^\infty a(n) \frac{x^n}{1-x^n} = \sum_{n=1}^\infty a(n) \sum_{k=1}^\infty x^{kn} = \sum_{n=1}^\infty (a \star \bar 1)(n) x^n \]

This leads to: \[\sum_{n=1}^\infty \mu(n) \frac{x^n}{1-x^n} = x\] and: \[\sum_{n=1}^\infty \varphi(n) \frac{x^n}{1-x^n} = \frac{x}{(1-x)^2}\]

Inclusion-Exclusion

The Möbius and |\zeta| functions can be generalized to incidence algebras where this form is from the incidence algebra induced by the divisibility order2. A notable and relevant example of a Möbius functions for another, closely related, incidence algebra is when we consider the incidence algebra induced by finite multisets with the inclusion ordering. Let |T| be a finite multiset, we get |\mu(T) = \begin{cases}0,&T\text{ has repeated elements}\\(-1)^{\vert T\vert},&T\text{ is a set}\end{cases}|. Since we can view a natural number as a finite multiset of primes, and we can always relabel the elements of a finite multiset with distinct primes, this is equivalent to the Möbius function we’ve been using.

This leads to a nice and compact way of describing the principle of inclusion-exclusion. Let |A| and |S| be (finite) multisets with |S \subseteq A| and assume we have |f| and |g| defined on the set of sub-multisets of |A|. If \[g(A) = \sum_{S\subseteq A} f(S)\] then \[f(A) = \sum_{S\subseteq A}\mu(A\setminus S)g(S)\] and this is Möbius inversion for this notion of Möbius function. We can thus take a different perspective on Möbius inversion. If |P| is a finite multiset of primes, then \[g(n_P) = \sum_{Q\subseteq P}f(n_Q) \iff f(n_P) = \sum_{Q\subseteq P}\mu(P\setminus Q)g(n_Q)\] recalling that |Q\subseteq P \iff n_Q \mid n_P| and |n_{P\setminus Q} = n_P/n_Q| when |Q\subseteq P|.

We get traditional inclusion-exclusion by noting that |\mu(T)=(-1)^{\vert T\vert}| when |T| is a set, i.e. all elements have multiplicity at most |1|. Let |I| be a finite set and assume we have a family of finite sets, |\{T_i\}_{i\in I}|. Write |T = \bigcup_{i\in I}T_i| and define |\bigcap_{i\in\varnothing}T_i = T|.

Define \[f(J) = \left\vert\bigcap_{i\in I\setminus J}T_i\setminus\bigcup_{i \in J}T_i\right\vert\] for |J\subseteq I|. In particular, |f(I) = 0|. |f(J)| is then the number of elements shared by all |T_i| for |i\notin J| and no |T_j| for |j\in J|. Every |x \in \bigcup_{i\in I}T_i| is thus associated to exactly one such subset of |I|, namely |\{j\in I\mid x\notin T_j\}|. Formally, |x \in \bigcap_{i\in I\setminus J}T_i\setminus\bigcup_{i \in J}T_i \iff J = \{j\in I\mid x\notin T_j\}| so each |\bigcap_{i\in I\setminus J}T_i\setminus\bigcup_{i \in J}T_i| is disjoint and \[g(J) = \sum_{S\subseteq J}f(S) = \left\vert\bigcup_{S\subseteq J}\left(\bigcap_{i\in I\setminus S}T_i\setminus\bigcup_{i \in S}T_i\right)\right\vert = \left\vert\bigcap_{i\in I\setminus J}T_i\right\vert \] for |J \subseteq I|. In particular, |g(I) = \vert\bigcup_{i\in I}T_i\vert|.

By the Möbius inversion formula for finite sets, we thus have: \[f(J) = \sum_{S\subseteq J}(-1)^{\vert J\vert - \vert S\vert}g(S)\] which for |J = I| gives: \[ 0 = \sum_{J\subseteq I}(-1)^{\vert I\vert - \vert J\vert}\left\vert\bigcap_{i\in I\setminus J}T_i\right\vert = \left\vert\bigcup_{i\in I}T_i\right\vert + \sum_{J\subsetneq I}(-1)^{\vert I\vert - \vert J\vert}\left\vert\bigcap_{i\in I\setminus J}T_i\right\vert \] which is equivalent to the more usual form: \[\left\vert\bigcup_{i\in I}T_i\right\vert = \sum_{J\subsetneq I}(-1)^{\vert I\vert - \vert J\vert - 1}\left\vert\bigcap_{i\in I\setminus J}T_i\right\vert = \sum_{\varnothing\neq J\subseteq I}(-1)^{\vert J\vert + 1}\left\vert\bigcap_{i\in J}T_i\right\vert \]

|\varphi|

An obvious thing to explore is to apply Möbius inversion to various arithmetic functions. A fairly natural first start is applying Möbius inversion to the identity function. From the above results, we know that this unknown function |\varphi| will satisfy |\mathcal D[\varphi](s) = \zeta(s-1)/\zeta(s) = \mathcal D[\operatorname{id}\star\mu](s)|. We also immediately have the property that |n = \sum_{d \mid n}\varphi(d)|. Using Euler’s product formula we have: \[\begin{flalign} \zeta(s-1)/\zeta(s) & = \prod_{p \in \mathbb P} \frac{1 - p^{-s}}{1 - p^{-s+1}} \\ & = \prod_{p \in \mathbb P} \frac{1 - p^{-s}}{1 - pp^{-s}} \\ & = \prod_{p \in \mathbb P} (1 - p^{-s})\sum_{n=0}^\infty p^n p^{-ns} \\ & = \prod_{p \in \mathbb P} \left(\sum_{n=0}^\infty p^n p^{-ns}\right) - \left(\sum_{n=0}^\infty p^n p^{-s} p^{-ns}\right) \\ & = \prod_{p \in \mathbb P} \left(\sum_{n=0}^\infty p^n p^{-ns}\right) - \left(\sum_{n=0}^\infty p^n p^{-(n + 1)s}\right) \\ & = \prod_{p \in \mathbb P} \left(1 + \sum_{n=1}^\infty p^n p^{-ns}\right) - \left(\sum_{n=1}^\infty p^{n-1} p^{-ns}\right) \\ & = \prod_{p \in \mathbb P} \left(1 + \sum_{n=1}^\infty (p^n - p^{n-1}) p^{-ns}\right) \\ & = \prod_{p \in \mathbb P} \left(1 + \sum_{n=1}^\infty \varphi(p^n) p^{-ns}\right) \\ & = \mathcal D[\varphi](s) \end{flalign}\]

So |\varphi| is the multiplicative function defined by |\varphi(p^n) = p^n - p^{n-1}|. For |p^n|, we can see that this counts the number of positive integers less than or equal to |p^n| which are coprime to |p^n|. There are |p^n| positive integers less than or equal to |p^n|, and every |p|th one is a multiple of |p| so |p^n/p = p^{n-1}| are not coprime to |p^n|. All the remainder are coprime to |p^n| since they don’t have |p| in their prime factorizations and |p^n| only has |p| in its. We need to verify that this interpretation is multiplicative. To be clear, we know that |\varphi| is multiplicative and that this interpretation works for |p^n|. The question is whether |\varphi(n)| for general |n| meets the above description, i.e. whether the number of coprime numbers less than |n| is multiplicative.

Theorem: The number of coprime numbers less than |n| is multiplicative and is equal to |\varphi(n)|.

Proof: |\varphi = \mu\star\operatorname{id}|. We have:

\[\begin{flalign} \varphi(n_P) & = \sum_{d\mid n_P}\mu(d)\frac{n_P}{d} \\ & = \sum_{Q\subseteq P}\mu(Q)\frac{n_P}{n_Q} \\ & = \sum_{Q\subseteq \mathrm{dom}(P)}(-1)^{\vert Q\vert}\frac{n_P}{n_Q} \end{flalign}\]

We can see an inclusion-exclusion pattern. Specifically, let |C_k = \{ c \in [k] \mid \gcd(c, k) = 1\}| be the numbers less than or equal to |k| and coprime to |k|. Let |S_{k,m} = \{ c \in [k] \mid m \mid c\}|. We have |S_{k,a} \cap S_{k,b} = S_{k,\operatorname{lcm}(a,b)}|. Also, when |c \mid k|, then |\vert S_{k,c}\vert = k/c|. |C_{n_P} = [n_P] \setminus \bigcup_{p \in \mathrm{dom}(P)} S_{n_P,p}| because every number not coprime to |n_P| shares some prime factor with it. Applying inclusion-exclusion to the union yields \[\begin{align} \vert C_{n_P}\vert & = n_P - \sum_{\varnothing\neq Q\subseteq\mathrm{dom}(P)}(-1)^{\vert Q\vert+1}\left\vert \bigcap_{p\in Q}S_{n_P,p}\right\vert \\ & = n_P + \sum_{\varnothing\neq Q\subseteq\mathrm{dom}(P)}(-1)^{\vert Q\vert}\frac{n_P}{\prod_{p\in Q}p} \\ & = \sum_{Q\subseteq\mathrm{dom}(P)}(-1)^{\vert Q\vert}\frac{n_P}{n_Q} \end{align}\] |\square|

Many of you will already have recognized that this is Euler’s totient function.

Combinatorial Species

The book Combinatorial Species and Tree-Like Structures has many examples where Dirichlet convolutions and Möbius inversion come up. A combinatorial species is a functor |\operatorname{Core}(\mathbf{FinSet})\to\mathbf{FinSet}|. Any permutation on a finite set can be decomposed into a collection of cyclic permutations. Let |U| be a finite set of cardinality |n| and |\pi : U \cong U| a permutation of |U|. For any |u\in U|, there is a smallest |k\in\mathbb N_+| such that |\pi^k(u) = u| where |\pi^{k+1} = \pi \circ \pi^k| and |\pi^0 = \operatorname{id}|. The |k| elements |\mathcal O(u)=\{\pi^{i-1}(u)\mid i\in[k]\}| make up a cycle of length |k|, and |\pi| restricted to |U\setminus O(u)| is a permutation on this smaller set. We can just inductively pull out another cycle until we run out of elements. Write |\pi_k| for the number of cycles of length |k| in the permutation |\pi|. We clearly have |n = \sum_{k=1}^\infty k\pi_k| as every cycle has |k| elements in it.

Write |\operatorname{fix}\pi| for the number of fixed points of |\pi|, i.e. the cardinality of the set |\{u\in U\mid \pi(u) = u\}|. Clearly, every element that is fixed by |\pi^k| needs to be in a cycle whose length divides |k|. This leads to the equation:

\[ \operatorname{fix}\pi^k = \sum_{d\mid k} d\pi_d = ((d \mapsto d\pi_d) \star \bar 1)(k)\]

Since |F(\pi^k) = F(\pi)^k| for a combinatorial species |F|, Möbius inversion, as explicitly stated in Proposition 2.2.3 of Combinatorial Species and Tree-Like Structures, leads to:

\[k(F(\pi))_k = \sum_{d\mid k}\mu\left(\frac{k}{d}\right)\operatorname{fix}F(\pi^d) = (\mu\star(d\mapsto \operatorname{fix}F(\pi^d)))(k) \]

If we Dirichlet convolve both sides of this with |\operatorname{id}|, replacing |F(\pi)| with |\beta| as it doesn’t matter that this permutation comes from an action of a species, we get:

\[\sum_{d\mid m} d\beta_d(m/d) = m\sum_{d\mid m} \beta_d = (\varphi\star(d\mapsto \operatorname{fix}\beta^d))(m)\]

This is just using |\varphi = \operatorname{id}\star\mu|. If we choose |m| such that |\beta^m = \operatorname{id}|, then we get |\sum_{d\mid m} \beta_d = \sum_{k=1}^\infty \beta_k| because |\beta_k| will be |0| for all the |k| which don’t divide |m|. This makes the previous equation into equation 2.2 (34) in the book.

Since we know |n = \sum_{k=1}^\infty k\pi_k| for any permutation |\pi|, we also get: \[\vert F([n])\vert = \sum_{k=1}^\infty\sum_{d\mid k}\mu\left(\frac{k}{d}\right)\operatorname{fix}F(\pi^d) = \sum_{k=1}^\infty(\mu\star(d\mapsto\operatorname{fix}F(\pi^d)))(k)\]

These equations give us a way to compute some of these divisor sums by looking at the number fixed points and cycles of the action of species and vice versa. For example, 2.3 (49) is a series of Dirichlet convolutions connected to weighted species.

Derivative of Dirichlet series

We can easily compute the derivative of a Dirichlet series (assuming sufficiently strong convergence so we can push the differentiation into the sum):

\[\begin{flalign} \mathcal D[f]’(s) & = \frac{d}{ds}\sum_{n=1}^\infty f(n)n^{-s} \\ & = \sum_{n=1}^\infty f(n)\frac{d}{ds}n^{-s} \\ & = \sum_{n=1}^\infty f(n)\frac{d}{ds}e^{-s\ln n} \\ & = \sum_{n=1}^\infty -f(n)\ln n e^{-s\ln n} \\ & = -\sum_{n=1}^\infty f(n)\ln n n^{-s} \\ & = -\mathcal D[f\ln](s) \end{flalign}\]

This leads to the identity |\frac{d}{ds}\ln\mathcal D[f](s) = \mathcal D[f]’ (s)/\mathcal D[f](s) = -\mathcal D[f\ln \star \mu](s)|. For example, we have |-\zeta’(s)/\zeta(s) = \mathcal D[\ln \star \mu](s)|. Using the Euler product formula, we have |\ln\zeta(s) = -\sum_{p\in\mathbb P}\ln(1-p^{-s})|. Differentiating this gives \[\begin{flalign} \frac{d}{ds}\ln\zeta(s) & = -\sum_{p\in\mathbb P} p^{-s}\ln p/(1 - p^{-s}) \\ & = -\sum_{p\in\mathbb P} \sum_{k=1}^\infty \ln p (p^k)^{-s} \\ & = -\sum_{n=1}^\infty \Lambda(n) n^{-s} \\ & = -\mathcal D[\Lambda](s) \end{flalign}\] where |\Lambda(n) = \begin{cases}\ln p,&p\in\mathbb P\land\exists k\in\mathbb N_+.n=p^k \\ 0, & \text{otherwise}\end{cases}|. |\Lambda|, which is not a multiplicative nor an additive function, is known as the von Mangoldt function. Just to write it explicitly, the above implies |\Lambda = \ln \star \mu|, i.e. |\Lambda| is the Möbius inversion of |\ln|. This can be generalized for arbitrary completely multiplicative functions besides |\bar 1| to get |\mathcal D[f]’/\mathcal D[f] = \mathcal D[f\Lambda]|.

We now have multiple perspectives on |\Lambda| which is a kind of “indicator function” for prime powers.

Dirichlet Inverse

Let’s say we’re given an arithmetic function |f|, and we want to find an arithmetic function |g| such that |f \star g = \delta| which we’ll call the Dirichlet inverse of |f|. We immediately get |(f \star g)(1) = f(1)g(1) = 1 = \delta(1)|. So, supposing |f(1)\neq 1|, we can define |g(1) = 1/f(1)|. We then get a recurrence relation for all the remaining values of |g| via: \[0 = (f \star g)(n) = f(1)g(n) + \sum_{d \mid n, d\neq 1} f(d)g(n/d)\] for |n > 1|. Solving for |g(n)|, we have: \[g(n) = -f(1)^{-1}\sum_{d\mid n,d\neq 1}f(d)g(n/d)\] where the right-hand side only requires |g(k)| for |k < n|. If |f| is multiplicative, then |f(1) = 1| and the inverse of |f| exists.

If |f| is completely multiplicative, its Dirichlet inverse is |\mu f|. This follows easily from |f \star \mu f = (\bar 1 \star \mu)f = \delta f = \delta|. As an example, |({-})^z| is completely multiplicative so its inverse is |({-})^z\mu|. Since the inverse of a Dirichlet convolution is the convolution of the inverses, we get |\varphi^{-1}(n) = \sum_{d\mid n}d\mu(d)|. Not to be confused with |\varphi(n) = (\operatorname{id}\star\mu)(n) = \sum_{d\mid n} d\mu(n/d)|.

Less trivially, the inverse of a multiplicative function is also a multiplicative function. We can prove it by complete induction on |\mathbb N_+| using the formula for |g| from above.

Theorem: If |f\star g = \delta|, then |g| is multiplicative when |f| is.

Proof: Let |n = ab| where |a| and |b| are coprime. If |a| (or, symmetrically, |b|) is equal to |1|, then since |g(1) = 1/f(1) = 1|, we have |g(1n) = g(1)g(n) = g(n)|. Now assume neither |a| nor |b| are |1| and, as the induction hypothesis, assume that |g| is multiplicative on all numbers less than |n|. We have: \[\begin{flalign} g(ab) & = -\sum_{d\mid ab,d\neq 1}f(d)g(ab/d) \\ & = -\sum_{d_a \mid a}\sum_{d_b \mid b,d_a d_b \neq 1}f(d_ad_b)g(ab/(d_ad_b)) \\ & = -\sum_{d_a \mid a}\sum_{d_b \mid b,d_a d_b \neq 1}f(d_a)f(d_b)g(a/d_a)g(b/d_b)) \\ & = -\sum_{d_b \mid b,d_b \neq 1}f(d_b)g(a)g(b/d_b)) - \sum_{d_a \mid a,d_a \neq 1}\sum_{d_b \mid b}f(d_a)f(d_b)g(a/d_a)g(b/d_b)) \\ & = -g(a)\sum_{d \mid b,d \neq 1}f(d)g(b/d)) - \sum_{d_a \mid a,d_a \neq 1}f(d_a)g(a/d_a)\sum_{d_b \mid b}f(d_b)g(b/d_b)) \\ & = g(a)g(b) - \sum_{d_a \mid a,d_a \neq 1}f(d_a)g(a/d_a) (f \star g)(b) \\ & = g(a)g(b) - \delta(b)\sum_{d_a \mid a,d_a \neq 1}f(d_a)g(a/d_a) \\ & = g(a)g(b) \end{flalign}\] |\square|

Assuming |f| has a Dirichlet inverse, we also have: \[\mathcal D[f^{-1}](s) = \mathcal D[f](s)^{-1}\] immediately from the convolution theorem.

More Examples

Given a multiplicative function |f|:

\[\begin{align} \mathcal D[f(\gcd({-},n_P))](s) & = \zeta(s)\prod_{(p,k)\in P}(1 - p^{-s})\left(\sum_{n=0}^\infty f(p^{\min(k,n)})p^{-ns}\right) \\ & = \zeta(s)\prod_{(p,k)\in P}(1 - p^{-s})\left(\frac{f(p^k)p^{-(k+1)s}}{1 - p^{-s}} + \sum_{n=0}^k f(p^n)p^{-ns}\right) \end{align}\]

As an example, |\eta(s) = (1 - 2^{1-s})\zeta(s) = \mathcal D[f](s)| where |f(n) = \begin{cases}-1,&n=2\\1,&n\neq 2\end{cases}|.

Alternatively, |f(n) = \mu(\gcd(n, 2))| and we can apply the above formula to see: \[\begin{flalign} \mathcal D[\mu(\gcd({-},2))] & = \zeta(s)(1-2^{-s})\left(\frac{\mu(2)2^{-2s}}{1 - 2^{-s}} + \sum_{n=0}^1 \mu(2^n)2^{-ns}\right) \\ & = \zeta(s)(1-2^{-s})\left(\frac{-2^{-2s}}{1 - 2^{-s}} + 1 - 2^{-s}\right) \\ & = \zeta(s)(-2^{-2s} + (1 - 2^{-s})^2) \\ & = \zeta(s)(1 - 2^{1-s}) \end{flalign}\]

|\lambda| and |\gamma|

Recalling, |\lambda| is completely multiplicative and is characterized by |\lambda(p) = -1|.

We can show that |\mathcal D[\lambda](s) = \zeta(2s)/\zeta(s)| which is equivalent to saying |\bar 1^{(2)} \star \mu = \lambda| or |\lambda\star\bar 1 = \bar 1^{(2)}|.

\[\begin{flalign} \zeta(2s)/\zeta(s) & = \prod_{p\in\mathbb P} \frac{1-p^{-s}}{1-(p^{-s})^2} \\ & = \prod_{p\in\mathbb P} \frac{1-p^{-s}}{(1-p^{-s})(1+p^{-s})} \\ & = \prod_{p\in\mathbb P} (1 + p^{-s})^{-1} \\ & = \prod_{p\in\mathbb P} (1 - \lambda(p)p^{-s})^{-1} \\ & = \mathcal D[\lambda](s) \end{flalign}\]

We have |\lambda\mu = \vert\mu\vert = \mu\mu| is the inverse of |\lambda| so |\mathcal D[\vert\mu\vert](s) = \zeta(s)/\zeta(2s)|.

Recalling, |\gamma| is multiplicative and is characterized by |\gamma(p^n) = -1|.

\[\begin{flalign} \mathcal D[\gamma](s) & = \prod_{p \in \mathbb P}\left(1 + \sum_{n=1}^\infty \gamma(p^n)p^{-ns}\right) \\ & = \prod_{p \in \mathbb P}\left(1 - \sum_{n=1}^\infty p^{-ns}\right) \\ & = \prod_{p \in \mathbb P}\left(1 - \left(\sum_{n=0}^\infty p^{-ns} - 1\right)\right) \\ & = \prod_{p \in \mathbb P}\frac{2(1 - p^{-s}) - 1}{1 - p^{-s}} \\ & = \prod_{p \in \mathbb P}\frac{1 - 2p^{-s}}{1 - p^{-s}} \end{flalign}\]

This implies that |(\gamma\star\mu)(p^n) = \begin{cases}-2, & n=1 \\ 0, & n > 1 \end{cases}|.

Indicator Functions

Let |1_{\mathbb P}| be the indicator function for the primes. We have |\omega = 1_{\mathbb P}\star\bar 1| or |1_{\mathbb P} = \omega\star\mu|. Directly, |\mathcal D[1_{\mathbb P}](s) = \sum_{p\in\mathbb P}p^{-s}| so we have |\mathcal D[\omega](s)/\zeta(s) = \sum_{p\in\mathbb P} p^{-s}|.

Lemma: |\mathcal D[1_{\mathbb P}](s)=\sum_{n=1}^\infty \frac{\mu(n)}{n}\ln\zeta(ns)|
Proof: Recalling the expansion |\ln(1-x) = -\sum_{n=1}^\infty x^n/n|, we proceed as follows: \[\begin{align} \sum_{n=1}^\infty \frac{\mu(n)}{n}\ln\zeta(ns) & = \sum_{n=1}^\infty \frac{\mu(n)}{n}\ln\left(\prod_{p\in\mathbb P}(1 - p^{-ns})^{-1}\right) \\ & = -\sum_{n=1}^\infty \frac{\mu(n)}{n}\sum_{p\in\mathbb P}\ln(1 - p^{-ns}) \\ & = \sum_{p\in\mathbb P}\sum_{n=1}^\infty \frac{\mu(n)}{n}\sum_{k=1}^\infty p^{-kns}/k \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \sum_{N=kn} \frac{\mu(n)}{N}p^{-Ns} \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}\sum_{N=kn}\mu(n) \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}(\mu\star\bar 1)(N) \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}\delta(N) \\ & = \sum_{p\in\mathbb P} p^{-s} \\ & = \mathcal D[1_{\mathbb P}](s) \end{align}\] |\square|

Let |1_{\mathcal P}| be the indicator function for prime powers. |\Omega = 1_{\mathcal P}\star\bar 1| or |1_{\mathcal P} = \Omega\star\mu|. |\mathcal D[1_{\mathcal P}](s) = \sum_{p\in\mathbb P}(1 - p^{-s})^{-1}| so we have |\mathcal D[\Omega](s)/\zeta(s) = \sum_{p\in\mathbb P}(1 - p^{-s})^{-1}|.

Lemma: |\mathcal D[1_{\mathcal P}](s)=\sum_{n=1}^\infty \frac{\varphi(n)}{n}\ln\zeta(ns)|
Proof: This is quite similar to the previous proof. \[\begin{align} \sum_{n=1}^\infty \frac{\varphi(n)}{n}\ln\zeta(ns) & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}\sum_{N=kn}\varphi(n) \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}(\varphi\star\bar 1)(N) \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N} N \\ & = \sum_{p\in\mathbb P}\sum_{N=1}^\infty p^{-Ns} \\ & = \mathcal D[1_{\mathcal P}](s) \end{align}\] |\square|

Summatory Functions

One thing we’ve occasionally been taking for granted is that the operator |\mathcal D| is injective. That is, |\mathcal D[f] = \mathcal D[g]| if and only if |f = g|. To show this, we’ll use the fact that we can (usually) invert the Mellin transform which can be viewed roughly as a version of |\mathcal D| that operates on continuous functions.

Before talking about the Mellin transform, we’ll talk about summatory functions as this will ease our later discussion.

We will turn a sum into a continuous function via a zero-order hold, i.e. we will take the floor of the input. Thus |\sum_{n\leq x} f(n)| is constant on any interval of the form |[k,k+1)|. It then (potentially) has jump discontinuities at integer values. The beginning of the sum is at |n=1| so for all |x<1|, the sum up to |x| is |0|. We will need a slight tweak to better deal with these discontinuities. This will be indicated by a prime on the summation sign.

For non-integer values of |x|, we have: \[\sum_{n \leq x}’ f(n) = \sum_{n \leq x} f(n)\]

For |m| an integer, we have: \[ \sum_{n \leq m}’ f(n) = \frac{1}{2}\left(\sum_{n<m} f(n) + \sum_{n \leq m} f(n)\right) = \sum_{n\leq m} f(n) - f(m)/2 \]

This kind of thing should be familiar to those who’ve worked with things like Laplace transforms of discontinuous functions. (Not for no reason…)

One reason for introducing these summation functions is they are a little easier to work with. Arguably, we want something like |\frac{d}{dx}\sum_{n\leq x}f(n) = \sum_{n=1}^\infty f(n)\delta(n-x)|, but that means we end up with a bunch of distribution nonsense and even more improper integrals. The summation function may be discontinuous, but it at least has a finite value everywhere. Of course, another reason for introducing these functions is that they often are values we’re interested in.

Several important functions are these continuous “sums” of arithmetic functions of this form:

  • Mertens function: |M(x) = \sum_{n\leq x}’ \mu(n)|
  • Chebyshev function: |\vartheta(x) = \sum_{p\leq x, p\in\mathbb P}’ \ln p = \sum_{n\leq x} 1_{\mathbb P}(n)\ln n|
  • Second Chebyshev function: |\psi(x) = \sum_{n\leq x}’ \Lambda(n) = \sum_{n=1}^\infty \vartheta(x^{1/n})|
  • The prime-counting function: |\pi(x) = \sum_{n\leq x}’ 1_{\mathbb P}|
  • Riemann’s prime-power counting function: |\Pi_0(x) = \sum_{n\leq x} \frac{\Lambda(n)}{\ln n} = \sum_{n=1}^\infty \sum_{p^n\leq x,p\in\mathbb P}’ n^{-1} = \sum_{n=1}^\infty\pi(x^{1/n})n^{-1}|
  • |D(x) = \sum_{n\leq x}d(n)|

These are interesting in how they related to the prime-counting function.

Let’s consider the arithmetic function |\Lambda/\ln| whose Dirichlet series is |\ln\zeta|.

We have the summation function |\sum_{n\leq x}’ \Lambda(n)/\ln(n)|, but |\Lambda(n)| is |0| except when |n=p^k| for some |p\in\mathbb P| and |k\in\mathbb N_+|. Therefore, we have \[\begin{align} \sum_{n\leq x}’ \frac{\Lambda(n)}{\ln(n)} & = \sum_{k=1}^\infty\sum_{p^k\leq x, p\in\mathbb P}’ \frac{\Lambda(p^k)}{\ln(p^k)} \\ & = \sum_{k=1}^\infty\sum_{p^k\leq x, p\in\mathbb P}’ \frac{\ln(p)}{k\ln(p)} \\ & = \sum_{k=1}^\infty\sum_{p^k\leq x, p\in\mathbb P}’ \frac{1}{k} \\ & = \sum_{k=1}^\infty \frac{1}{k} \sum_{p^k\leq x, p\in\mathbb P}’ 1 \\ & = \sum_{k=1}^\infty \frac{1}{k} \sum_{p\leq x^{1/k}, p\in\mathbb P}’ 1 \\ & = \sum_{k=1}^\infty \frac{\pi(x^{1/k})}{k} \\ \end{align}\]

|\ln\zeta(s) = s\mathcal M[\Pi_0](-s)=\mathcal D[\Lambda/\ln](s)| where |\mathcal M| is the Mellin transform, and the connection to Dirichlet series is described in the following section.

Mellin Transform

The definition of the Mellin transform and its inverse are:

\[\mathcal M[f](s) = \int_0^\infty x^s\frac{f(x)}{x}dx\] \[\mathcal M^{-1}[\varphi](x) = \frac{1}{2\pi i}\int_{c-i\infty}^{c+i\infty} x^{-s}\varphi(s)ds\]

The contour integral is intended to mean the vertical line with real part |c| traversed from negative to positive imaginary values. Modulo the opposite sign of |s| and the extra factor of |x|, this is quite similar to a continuous version of a Dirichlet series.

The Mellin transform is closely related to the two-sided Laplace transform.

\[\mathcal D[f](s) = s\mathcal M\left[x\mapsto \sum_{n\leq x}’ f(n)\right](-s)\]

Using Mellin transform properties, particularly the one for transforming the derivative, we can write the following.

\[\begin{align} \mathcal D[f](s) = s\mathcal M\left[x\mapsto \sum_{n\leq x}’ f(n)\right](-s) & \iff \mathcal D[f](1-s) = -(s-1)\mathcal M\left[x\mapsto \sum_{n\leq x}’ f(n)\right](s-1) \\ & \iff \mathcal D[f](1-s) = \mathcal M\left[x\mapsto \frac{d}{dx}\sum_{n\leq x}’ f(n)\right](s) \\ & \iff \mathcal D[f](1-s) = \int_0^\infty x^{s-1}\frac{d}{dx}\sum_{n\leq x}’ f(n)dx \\ & \iff \mathcal D[f](1-s) = \int_0^\infty x^{s-1}\sum_{n=1}^\infty f(n)\delta(x-n)dx \\ & \iff \mathcal D[f](1-s) = \sum_{n=1}^\infty f(n)n^{s-1} \\ & \iff \mathcal D[f](s) = \sum_{n=1}^\infty f(n)n^{-s} \end{align}\]

This leads to Perron’s formula

\[\begin{align} \sum_{n\leq x}’ f(n) & = \mathcal M^{-1}[s\mapsto -\mathcal D[f](-s)/s](x) \\ & = \frac{1}{2\pi i}\int_{-c-i\infty}^{-c+i\infty}\frac{\mathcal D[f](-s)}{-s} x^{-s} ds \\ & = -\frac{1}{2\pi i}\int_{c+i\infty}^{c-i\infty}\frac{\mathcal D[f](s)}{s} x^s ds \\ & = \frac{1}{2\pi i}\int_{c-i\infty}^{c+i\infty}\frac{\mathcal D[f](s)}{s} x^s ds \end{align}\]

for which we need to take the Cauchy principal value to get something defined. (See also Abel summation.)

There are side conditions on the convergence of |\mathcal D[f]| for these formulas to be justified. See the links.

Many of the operations we’ve described on Dirichlet series follow from Mellin transform properties. For example, we have |\mathcal M[f]’(s) = \mathcal M[f\ln](s)| generally.

Summary

Properties

Dirichlet Convolution

Dirichlet convolution is |(f\star g)(n) = \sum_{d\mid n} f(d)g(n/d) = \sum_{mk=n} f(m)g(k)|.

Dirichlet convolution forms a commutative ring with it as the multiplication, |\delta| as the multiplicative unit and the usual additive structure. This is to say that Dirichlet convolution is commutative, associative, unital, and bilinear.

For |f| completely multiplicative, |f(g\star h) = fg \star fh|.

Dirichlet Inverse

For any |f| such that |f(1)\neq 0|, there is a |g| such that |f\star g = \delta|. In particular, the set of multiplicative functions forms a subgroup of this multiplicative group, i.e. the Dirichlet convolution of multiplicative functions is multiplicative.

If |f(1) \neq 0|, then |f \star g = \delta| where |g| is defined by the following recurrence:

\[\begin{flalign} g(1) & = 1/f(1) \\ g(n) & = -f(1)^{-1}\sum_{d\mid n,d\neq 1}f(d)g(n/d) \end{flalign}\]

For a completely multiplicative |f|, its Dirichlet inverse is |\mu f|.

Convolution Theorem

\[\mathcal D[f](s)\mathcal D[g](s) = \mathcal D[f\star g](s)\]

Möbius Inversion

\[\delta = \bar 1 \star \mu\]

This means from a divisor sum |g(n)\sum_{d\mid n}f(d) = (f\star\bar 1)(n)| for each |n|, we can recover |f| via |g\star\mu = f\star\bar 1\star\mu = f|. Which is to say |f(n)=\sum_{d\mid n}g(d)\mu(n/d)|.

This can be generalized via |({-})^k\mu\star({-})^k = \delta|. In sums, this means when |g(n)=\sum_{d\mid n}d^k f(n/d)|, then |f(n)=\sum_{d\mid n}\mu(d)d^k g(n/d)|.

Let |h| be a completely multiplicative function. Given |g(m) = \sum_{n=1}^\infty f(mh(n))|, then |f(n) = \sum_{m=1}^\infty \mu(m)g(nh(m))|.

Using the Möbius function for finite multisets and their inclusion ordering, we can recast Möbius inversion of naturals as Möbius inversion of finite multisets (of primes) a la: \[n_P = \sum_{Q\subseteq P}\mu(P\setminus Q)n_Q = \sum_{Q\subseteq P}\mu(n_P/n_Q)n_Q = \sum_{d\mid n_P}\mu(n_P/d)d \]

Dirichlet Series

\[\mathcal D[f](s) = \sum_{n=1}^\infty f(n)n^{-s}\]

\[\mathcal D[n\mapsto f(n)n^k](s) = \mathcal D[f](s - k)\]

\[\mathcal D[f^{-1}](s) = \mathcal D[f](s)^{-1}\] where the inverse on the left is the Dirichlet inverse.

\[\mathcal D[f]’(s) = -\mathcal D[f\ln](s)\]

For a completely multiplicative |f|, \[\mathcal D[f]’(s)/\mathcal D[f](s) = -\mathcal D[f\Lambda](s)\] and: \[\ln\mathcal D[f](s) = \mathcal D[f\Lambda/\ln](s)\]

Dirichlet series as a Mellin transform:

\[\mathcal D[f](s) = s\mathcal M\left[x\mapsto \sum_{n\leq x}’ f(n)\right](-s)\]

The corresponding inverse Mellin transform statement is called Perron’s Formula:

\[\sum_{n\leq x}’ f(n) = \frac{1}{2\pi i}\int_{c-i\infty}^{c+i\infty}\frac{\mathcal D[f](s)}{s} x^s ds\]

Euler Product Formula

Assuming |f| is multiplicative, we have:

\[\mathcal D[f](s) = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f(p^n)p^{-ns} = \prod_{p \in \mathbb P}\left(1 + \sum_{n=1}^\infty f(p^n)p^{-ns}\right) \]

When |f| is completely multiplicative, this can be simplified to:

\[\mathcal D[f](s) = \prod_{p \in \mathbb P}(1 - f(p)p^{-s})^{-1} \]

Lambert Series

Given an arithmetic function |a|, these are series of the form: \[ \sum_{n=1}^\infty a(n) \frac{x^n}{1-x^n} = \sum_{n=1}^\infty (a \star \bar 1)(n) x^n \]

\[\sum_{n=1}^\infty \mu(n) \frac{x^n}{1-x^n} = x\]

\[\sum_{n=1}^\infty \varphi(n) \frac{x^n}{1-x^n} = \frac{x}{(1-x)^2}\]

Arithmetic function definitions

|f(p^n)=\cdots| implies a multiplicative/additive function, while |f(p)=\cdots| implies a completely multiplicative/additive function.

|p^z| for |z\in\mathbb C| is completely multiplicative. This includes the identity function (|z=1|) and |\bar 1| (|z=0|). For any multiplicative |f|, |f\circ \gcd({-},k)| is multiplicative.

|\ln| is completely additive.

Important but neither additive nor multiplicative are the indicator functions for primes |1_{\mathbb P}| and prime powers |1_{\mathcal P}|.

The following functions are (completely) multiplicative unless otherwise specified.

\[\begin{align} \delta(p) & = 0 \tag{Kronecker delta} \\ \bar 1(p) & = 1 = p^0 \\ \mu(p^n) & = \begin{cases}-1, & n = 1 \\ 0, & n > 1\end{cases} \tag{Möbius function} \\ \Omega(p) & = 1 \tag{additive} \\ \lambda(p) & = -1 = (-1)^{\Omega(p)} \tag{Liouville function} \\ \omega(p^n) & = 1 \tag{additive} \\ \gamma(p^n) & = -1 = (-1)^{\omega(p^n)} \\ a(p^n) & = p(n) \tag{p(n) is the partition function} \\ \varphi(p^n) & = p^n - p^{n-1} = p^n(1 - 1/p) = J_1(p^n) \tag{Euler totient function} \\ \sigma_k(p^n) & = \sum_{m=0}^n p^{km} = \sum_{d\mid p^n} d^k = \frac{p^{k(n+1)}-1}{p^k - 1} \tag{last only works for k>0} \\ d(p^n) & = n + 1 = \sigma_0 \\ f^{[k]}(p^n) & = \begin{cases}f(p^m),& km=n\\0,& k\nmid n\end{cases} \tag{f multiplicative} \\ \Lambda(n) & = \begin{cases}\ln p,&p\in\mathbb P\land\exists k\in\mathbb N_+.n=p^k \\ 0, & \text{otherwise}\end{cases} \tag{not multiplicative} \\ J_k(p^n) & = p^{kn} - p^{k(n-1)} = p^{kn}(1 - p^{-k}) \tag{Jordan totient function} \\ \psi_k(p^n) & = p^{kn} + p^{k(n-1)} = p^{kn}(1 + p^{-k}) = J_{2k}(p^n)/J_k(p^n) \tag{Dedekind psi function} \\ \end{align}\]

Dirichlet convolutions

\[\begin{align} \delta & = \bar 1 \star \mu \\ \varphi & = \operatorname{id}\star\mu \\ \sigma_z & = ({-})^z \star \bar 1 = \psi_z \star \bar 1^{(2)} \\ \sigma_1 & = \varphi \star d \\ d & = \sigma_0 = \bar 1 \star \bar 1 \\ f \star f & = fd \tag{f completely multiplicative} \\ f\Lambda & = f\ln \star f\mu = f\ln \star f^{-1} \tag{f completely multiplicative, Dirichlet inverse} \\ \lambda & = \bar 1^{(2)} \star \mu \\ \vert\mu\vert & = \lambda^{-1} = \mu\lambda \tag{Dirichlet inverse} \\ 2^\omega & = \vert\mu\vert \star \bar 1 \\ \psi_z & = ({-})^z \star \vert\mu\vert \\ \operatorname{fix} \pi^{(-)} & = \bar 1 \star (k \mapsto k\pi_k) \tag{for a permutation} \\ ({-})^k & = J_k \star \bar 1 \end{align}\]

More Dirichlet convolution identities are here, though many are trivial consequences of the earlier properties.

Dirichlet series

\[\begin{array}{l|ll} f(n) & \mathcal D[f](s) & \\ \hline \delta(n) & 1 & \\ \bar 1(n) & \zeta(s) & \\ n & \zeta(s-1) & \\ n^z & \zeta(s-z) & \\ \sigma_z(n) & \zeta(s-z)\zeta(s) & \\ \mu(n) & \zeta(s)^{-1} & \\ \vert\mu(n)\vert & \zeta(s)/\zeta(2s) & \\ \varphi(n) & \zeta(s-1)/\zeta(s) & \\ d(n) & \zeta(s)^2 & \\ \mu(\gcd(n, 2)) & \eta(s) = (1-2^{1-s})\zeta(s) & \\ \lambda(n) & \zeta(2s)/\zeta(s) \\ \gamma(n) & \prod_{p \in \mathbb P}\frac{1-2p^{-s}}{1-p^{-s}} & \\ f^{[k]}(n) & \mathcal D[f](ks) & \\ f(n)\ln n & -\mathcal D[f]’ (s) & f\text{ completely multiplicative}\\ \Lambda(n) & -\zeta’(s)/\zeta(s) & \\ \Lambda(n)/\ln(n) & \ln\zeta(s) & \\ 1_{\mathbb P}(n) & \sum_{n=1}^\infty \frac{\mu(n)}{n}\ln\zeta(ns) & \\ 1_{\mathcal P}(n) & \sum_{n=1}^\infty \frac{\varphi(n)}{n}\ln\zeta(ns) & \\ \psi_k(n) & \zeta(s)\zeta(s - k)/\zeta(2s) & \\ J_k(n) & \zeta(s - k)/\zeta(s) & \end{array}\]


  1. Viewing natural numbers as multisets, |D_n| is the set of all sub-multisets of |n|. The isomorphism described is then simply the fact that given any sub-multiset of the union of two disjoint multisets, we can sort the elements into their original multisets producing two sub-multisets of the disjoint multisets.↩︎

  2. Incidence algebras are a decategorification of the notion of a category algebra.↩︎

August 22, 2025 11:25 PM

Edward Z. Yang

You could have invented CuTe hierarchical layout (but maybe not the rest of it?)

CuTe is a C++ library that aims to make dealing with complicated indexing easier. A key part of how it does this is by defining a Layout type, which specifies how to map from logical coordinates to physical locations (CuTe likes to say layouts are "functions from integers to integers.") In fact, CuTe layouts are a generalization of PyTorch strides, which say you always do this mapping by multiplying each coordinate with its respective stride and summing them together, e.g., i0 * s0 + i1 * s1 + .... Although NVIDIA's docs don't spell it out, the CuTe's generalization here is actually very natural, and in this blog post I'd like to explain how you could have invented it (on a good day).

First, a brief recap about strides. PyTorch views allow us to reinterpret the physical layout of a tensor in different ways, changing how we map logical coordinates into physical locations. For example, consider this 2-D tensor:

>>> torch.arange(4).view(2, 2)
tensor([[0, 1],
        [2, 3]])
>>> torch.arange(4).view(2, 2).stride()
(2, 1)

The physical memory reads 0, 1, 2, 3, and if I want to know what the value at coordinate (0, 1) is (row 0, col 1), I compute 0 * 2 + 1 * 1, which tells me I should read out the value at index 1 in physical memory. If I change the strides, I can change the order I read out the physical locations. For example, if I transpose I have:

>>> torch.arange(4).view(2, 2).T
tensor([[0, 2],
        [1, 3]])
>>> torch.arange(4).view(2, 2).T.stride()
(1, 2)

The physical memory hasn't changed, but now when we read out coordinate (0, 1), we compute 0 * 1 + 1 * 2, which tells me I should read the value at index 2 (which is indeed what I see at this coordinate!)

PyTorch also allows us to "flatten" dimensions of a tensor, treating them as a 1D tensor. Intuitively, a 2-D tensor flattened into a 1-D one involves just concatenating all the rows together into one line:

>>> torch.arange(4).view(2, 2).view(-1)
tensor([0, 1, 2, 3])

We should be able to do this for the transpose too, getting tensor([0, 2, 1, 3]), but instead, this is what you get:

>>> torch.arange(4).view(2, 2).T.view(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

The dreaded "use reshape instead" error! The error is unavoidable under PyTorch striding: there is no stride we can select that will cause us to read the elements in this order (0, 2, 1, 3); after all, i0 * s0 is a pretty simple equation, we can't simultaneously have 1 * s0 == 2 and 2 * s0 == 1.

Upon learning this, an understandable reaction is to just shrug, assume that this is impossible to fix, and move on with your life. But today, you are especially annoyed by this problem, because you were only trying to flatten N batch dimensions into a single batch dimension so that you could pass it through a function that only works with one batch dimension, with the plan of unflattening it when you're done. It doesn't matter that this particular layout is inexpressible with strides; you aren't going to rely on the layout in any nontrivial way, you just care that you can flatten and then unflatten back to the original layout.

Imagine we're dealing with a tensor of size (2, 2, 2) where the strides for dim 0 and dim 1 were transposed as (2, 4, 1). It should be OK to flatten this into a tensor (4, 2) and then unflatten it back to (2, 2, 2). Intuitively, I'd like to "remember" what the original sizes and strides are, so that I can go back to them. Here's an idea: let's just store the original size/stride as a nested entry in our size tuple. So instead of the size (4, 2), we have ((2, 2), 2); and now analogously the stride can simply be ((2, 4), 1). When I write (2, 2) as the "size" of a dimension, I really just mean the product 4, but there is some internal structure that affects how I should index its inside, namely, the strides (2, 4). If I ask for the row at index 2, I first have to translate this 1D coordinate into a 2D coordinate (1, 0), and then apply the strides to it like before.

Well, it turns out, this is exactly how CuTe layouts work! In CuTe, sizes/strides are hierarchical: a size is actually a tree of ints, where the hierarchy denotes internal structure of a dimension that you can address linearly (in fact, everything by default can be addressed in a 1-D linear way, even if its an N-D object.) The documentation of Layout does say this... but I actually suffered a lot extracting out the high level intuition of this blog post, because CuTe uses co-lexicographic ordering when linearizing (it iterates over coordinates (0,0), (1,0), (2,0), etc. rather than in the more normal lexicographic order (0,0), (0,1), (0,2)). This leads to some truly deranged example code where they print a 2D matrix in conventional lexicographic ordering, and then turn around and say, "But wait, if I have the layout take care of translating the 1D coordinate into an ND coordinate, it is colexicographic!!":

> print2D(s2xh4)
  0    2    1    3
  4    6    5    7
# sure, why not?

> print1D(s2xh4)
  0    4    2    6    1    5    3    7
# wtf???

In any case, if you want to engage with the documentation, s2xh4 is the important example to pay attention to for understanding the nested semantics. However, note the example is smeared across like five sections and also you need to know about the co-lexicographic thing to understand why the examples print the way they do.

by Edward Z. Yang at August 22, 2025 06:48 AM

Brent Yorgey

Decidable equality for indexed data types, take 2

Decidable equality for indexed data types, take 2

Posted on August 22, 2025
Tagged , , , , ,

In a post from a year ago, I explored how to prove decidable equality in Agda of a particular indexed data type. Recently, I discovered a different way to accomplish the same thing, without resorting to embedded sigma types.

This post is literate Agda; you can download it here if you want to play along. I tested everything here with Agda version 2.6.4.3 and version 2.0 of the standard library. (I assume it would also work with more recent versions, but haven’t tested it.)

Background

This section is repeated from my previous post, which I assume no one remembers.

First, some imports and a module declaration. Note that the entire development is parameterized by some abstract set B of base types, which must have decidable equality.

open import Data.Product using (Σ ; _×_ ; _,_ ; -,_ ; proj₁ ; proj₂)
open import Data.Product.Properties using (≡-dec)
open import Function using (__)
open import Relation.Binary using (DecidableEquality)
open import Relation.Binary.PropositionalEquality using (__ ; refl)
open import Relation.Nullary.Decidable using (yes; no; Dec)

module OneLevelTypesIndexed2 (B : Set) (≟B : DecidableEquality B) where

We’ll work with a simple type system containing base types, function types, and some distinguished type constructor □. So far, this is just to give some context; it is not the final version of the code we will end up with, so we stick it in a local module so it won’t end up in the top-level namespace.

module Unindexed where
  data Ty : Set where
    base : B  Ty
    __ : Ty  Ty  Ty
_ : Ty  Ty

For example, if \(X\) and \(Y\) are base types, then we could write down a type like \(\square ((\square \square X \to Y) \to \square Y)\):

  infixr 2 __
  infix 30_

  postulate
    BX BY : B

  X : Ty
  X = base BX
  Y : Ty
  Y = base BY

  example : Ty
  example =((□ □ X ⇒ Y) ⇒ □ Y)

However, for reasons that would take us too far afield in this blog post, I don’t want to allow immediately nested boxes, like \(\square \square X\). We can still have multiple boxes in a type, and even boxes nested inside of other boxes, as long as there is at least one arrow in between. In other words, I only want to rule out boxes immediately applied to another type with an outermost box. So we don’t want to allow the example type given above (since it contains \(\square \square X\)), but, for example, \(\square ((\square X \to Y) \to \square Y)\) would be OK.

Two encodings

In my previous blog post, I ended up with the following encoding of types indexed by a Boxity, which records the number of top-level boxes. Since the boxity of the arguments to an arrow type do not matter, we make them sigma types that package up a boxity with a type having that boxity. I was then able to define decidable equality for ΣTy and Ty by mutual recursion.

data Boxity : Set where
: Boxity
: Boxity

variable b b₁ b₂ b₃ b₄ : Boxity

module WithSigma where
  ΣTy : Set
  data Ty : Boxity  Set

  ΣTy = Σ Boxity Ty

  data Ty where
_ : Ty ₀  Ty ₁
    base : B  Ty ₀
    __ : ΣTy  ΣTy  Ty ₀

The problem is that working with this definition of Ty is really annoying! Every time we construct or pattern-match on an arrow type, we have to package up each argument type into a dependent pair with its Boxity; this introduces syntactic clutter, and in many cases we know exactly what the Boxity has to be, so it’s not even informative. The version we really want looks more like this:

data Ty : Boxity  Set where
  base : B  Ty ₀
  __ : {b₁ b₂ : Boxity}  Ty b₁  Ty b₂  Ty ₀
_ : Ty ₀  Ty ₁

infixr 2 __
infix 30_

In this version, the boxities of the arguments to the arrow constructor are just implicit parameters of the arrow constructor itself. Previously, I was unable to get decidable equality to go through for this version… but just the other day, I finally realized how to make it work!

Path-dependent equality

The key trick that makes everything work is to define a path-dependent equality type. I learned this from Martín Escardó. The idea is that we can express equality between two indexed things with different indices, as long as we also have an equality between the indices.

_≡⟦__ : {A : Set} {B : A  Set} {a₀ a₁ : A}  B a₀  a₀ ≡ a₁  B a₁  Set
b₀ ≡⟦ refl ⟧ b₁   =   b₀ ≡ b₁

That’s exactly what we need here: the ability to express equality between Ty values, which may be indexed by different boxities—as long as we know that the boxities are equal.

Decidable equality for Ty

We can now use this to directly encode decidable equality for Ty. First, we can easily define decidable equality for Boxity.


Boxity-≟ : DecidableEquality Boxity
Boxity-≟ ₀ ₀ = yes refl
Boxity-≟ ₀ ₁ = no λ ()
Boxity-≟ ₁ ₀ = no λ ()
Boxity-≟ ₁ ₁ = yes refl

Here is the type of the decision procedure: given two Ty values which may have different boxities, we decide whether or not we can produce a witness to their equality. Such a witness consists of a pair of (1) a proof that the boxities are equal, and (2) a proof that the types are equal, depending on (1).We would really like to write this as Σ (b₁ ≡ b₂) λ p → σ ≡⟦ p ⟧ τ, but for some reason Agda requires us to fill in some extra implicit arguments before it is happy that everything is unambiguous, requiring some ugly syntax.

Ty-≟′ : (σ : Ty b₁)  (τ : Ty b₂)  Dec (Σ (b₁ ≡ b₂) λ p  _≡⟦__ {_} {Ty} σ p τ)

Before showing the definition of Ty-≟′, let’s see that we can use it to easily define both a boxity-homogeneous version of decidable equality for Ty, as well as decidable equality for Σ Boxity Ty:

Ty-≟ : DecidableEquality (Ty b)
Ty-≟ {b} σ τ with Ty-≟′ σ τ
... | no σ≢τ = no  σ≡τ  σ≢τ ( refl , σ≡τ))
... | yes (refl , σ≡τ) = yes σ≡τ

ΣTy-≟ : DecidableEquality (Σ Boxity Ty)
ΣTy-≟ (_ , σ) (_ , τ) with Ty-≟′ σ τ
... | no σ≢τ = no λ { refl  σ≢τ (refl , refl) }
... | yes (refl , refl) = yes refl

A lot of pattern matching on refl and everything falls out quite easily.

And now the definition of Ty-≟′. It looks complicated, but it is actually not very difficult. The most interesting case is when comparing two arrow types for equality: we must first compare the boxities of the arguments, then consider the arguments themselves once we know the boxities are equal.

Ty-≟′ (□ σ) (□ τ) with Ty-≟′ σ τ
... | yes (refl , refl) = yes (refl , refl)
... | no σ≢τ = no λ { (refl , refl)  σ≢τ (refl , refl) }
Ty-≟′ (base S) (base T) with ≟B S T
... | yes refl = yes (refl , refl)
... | no S≢T = no λ { (refl , refl)  S≢T refl }
Ty-≟′ (__ {b₁} {b₂} σ₁ σ₂) (__ {b₃} {b₄} τ₁ τ₂) with Boxity-≟ b₁ b₃ | Boxity-≟ b₂ b₄ | Ty-≟′ σ₁ τ₁ | Ty-≟′ σ₂ τ₂
... | no b₁≢b₃ | _ | _ | _ = no λ { (refl , refl)  b₁≢b₃ refl }
... | yes _ | no b₂≢b₄ | _ | _ = no λ { (refl , refl)  b₂≢b₄ refl }
... | yes _ | yes _ | no σ₁≢τ₁ | _ = no λ { (refl , refl)  σ₁≢τ₁ (refl , refl) }
... | yes _ | yes _ | yes _ | no σ₂≢τ₂ = no λ { (refl , refl)  σ₂≢τ₂ (refl , refl) }
... | yes _ | yes _ | yes (refl , refl) | yes (refl , refl) = yes (refl , refl)
Ty-≟′ (_) (base _) = no λ ()
Ty-≟′ (_) (__) = no λ ()
Ty-≟′ (base _) (_) = no λ ()
Ty-≟′ (base _) (__) = no λ { (refl , ()) }
Ty-≟′ (__) (_) = no λ ()
Ty-≟′ (__) (base _) = no λ { (refl , ()) }
<noscript>Javascript needs to be activated to view comments.</noscript>

by Brent Yorgey at August 22, 2025 12:00 AM

August 21, 2025

in Code

The Baby Paradox in Haskell

Everybody Loves My Baby is a Jazz Standard from 1924 with the famous lyric:

Everybody loves my baby, but my baby don’t love nobody but me.

Which is often formalized as:

\[ \begin{align} \text{Axiom}_1 . & \forall x. \text{Loves}(x, \text{Baby}) \\ \text{Axiom}_2 . \forall x. & \text{Loves}(\text{Baby}, x) \implies x = me \end{align} \]

Let’s prove in Haskell (in one line) that these two statements, taken together, imply that I am my own baby.

The normal proof

The normal proof using propositional logic goes as follows:

  1. If everyone loves Baby, Baby must love baby. (instantiate axiom 1 with \(x = \text{Baby}\)).
  2. If baby loves someone, that someone must be me. (axiom 2)
  3. Therefore, because baby loves baby, baby must be me. (instantiate axiom 2 with axiom 1 with \(x = \text{Baby}\))

Haskell as a Theorem Prover

First, some background: when using Haskell as a theorem prover, you represent the theorem as a type, and proving it involves constructing a value of that type — you create an inhabitant of that type.

Using the Curry-Howard correspondence (often also called the Curry-Howard isomorphism), we can pair some simple logical connectives with types:

  1. Logical “and” corresponds to tupling (or records of values). If (a, b) is inhabited, it means that both a and b are inhabited.
  2. Logical “or” corresponds to sums, Either a b being inhabited implies that either a or b are inhabited. They might both the inhabited, but Either a b requires the “proof” of only one.
  3. Constructivist logical implication is a function: If a -> b is inhabited, it means that an inhabitant of a can be used to create an inhabitant of b.
  4. Any type with a constructor is “true”: (), Bool, String, etc.; any type with no constructor (data Void) is “false” because it has no inhabitants.
  5. Introducing type variables (forall a.) corresponds to…well, for all. If forall a. Either a () means that Either a () is “true” (inhabited) for all possible a. This one represented logically as \(\forall x. x \lor \text{True}\).

You can see that, by chaining together those primitives, you can translate a lot of simple proofs. For example, the proof of “If x and y together imply z, then x implies that y implies z”:

\[ \forall x y z. ((x \wedge y) \implies z) \implies (x \implies (y \implies z)) \]

can be expressed as:

curry :: forall a b c. ((a, b) -> c) -> a -> b -> c
curry f x y = f (x, y)

Or maybe, “If either x or y imply z, then x implies z and y implies z, independently:”

\[ \forall x y z. ((x \lor y) \implies z) \implies ((x \implies z) \land (y \implies z))) \]

In haskell:

unEither :: (Either a b -> c) -> (a -> c, b -> c)
unEither f = (f . Left, f . Right)

And, we have a version of negation: if a -> Void is inhabited, then a must be uninhabited (the principle of explosion). Let’s prove that “‘x or y’ being false implies both x and y are false”: \(\forall x y. \neg(x \lor y) \implies (\neg x \wedge \neg y)\)

deMorgan :: (Either a b -> Void) -> (a -> Void, b -> Void)
deMorgan f = (f . Left, f . Right)

(Maybe surprisingly, that’s the same proof as unEither!)

We can also think of “type functions” (type constructors that take arguments) as “parameterized propositions”:

data Maybe a = Nothing | Maybe a

Maybe a (like \(\text{Maybe}(x)\)) is the proposition that \(\text{True} \lor x\): Maybe a is always inhabited, because “True or X” is always True. Even Maybe Void is inhabited, as Nothing :: Maybe Void.

The sky is the limit if we use GADTs. We can create arbitrary propositions by restricting what types constructors can be called with. For example, we can create a proposition that x is an element of a list:

data Elem :: k -> [k] -> Type where
    Here :: Elem x (x : xs)
    There :: !(Elem x ys) -> Elem x (y : ys)

Read this as “Elem x xs is true (inhabited) if either x is the first item, or if x is an elem of the tail of the list”. So for example, Elem 5 [1,5,6] is inhabited but Elem 7 [1,5,6] is not:1

itsTrue :: Elem 5 [1,5,6]
itsTrue = There Here

itsNotTrue :: Elem 7 [1,5,6] -> Void
itsNotTrue = \case {}     -- GHC is smart enough to know both cases are invalid

We can create a two-argument proposition that two types are equal, a :~: b:

data (:~:) :: k -> k -> Type where
    Refl :: a :~: a

The proposition a :~: b is only inhabited if a is equal to b, since Refl is its only constructor.

Of course, this whole correspondence assumes we aren’t ever touching bottom (things like undefined for let x = x in x). For this exercise, we are working in a total subset of Haskell.

The Baby Paradox

Now we have enough. Let’s parameterize it over a proposition loves, where loves a b being inhabited means that a loves b.

We can express our axiom as a record of propositions in terms of the atoms loves, me, and baby:

data BabyAxioms loves me baby = BabyAxioms
    { everybodyLovesMyBaby :: forall x. loves x baby
    , myBabyOnlyLovesMe :: forall x. loves baby x -> x :~: me
    }

The first axiom everybodyLovesMyBaby means that for any x, loves x baby must be “true” (inhabited). The second axiom myBabyOnlyLovesMe means that if we have a loves baby x (if my baby loves someone), then it must be that x ~ me: we must be able to derive that person the baby loves is indeed me.

The expression of the baby paradox then relies on writing the function

babyParadox :: BabyAxioms loves me baby -> me :~: baby

And indeed if we play around with GHC enough, we’ll get this typechecking implementation:

babyParadox :: BabyAxioms loves me baby -> me :~: baby
babyParadox BabyAxioms{everybodyLovesMyBaby, myBabyOnlyLovesMe} =
    myBabyOnlyLovesMe everybodyLovesMyBaby

Using x & f = f x from Data.Function, this becomes a bit smoother to read:

babyParadox :: BabyAxioms loves me baby -> me :~: baby
babyParadox BabyAxioms{everybodyLovesMyBaby, myBabyOnlyLovesMe} =
    everybodyLovesMyBaby & myBabyOnlyLovesMe

And we have just proved it! It ended up being a one-liner. So, given the BabyAxioms loves me baby, it is possible to prove that me must be equal to baby. That is, it is impossible to create any BabyAxioms without me and baby being the same type.

The actual structure of the proof goes like this:

  1. First, we instantiated everybodyLovesBaby with x ~ baby, to get loves baby baby.
  2. Then, we used myBabyOnlyLovesMe, which normally takes loves baby x and returns x :~: me. Because we give it loves baby baby, we get a baby :~: me!

And that’s exactly the same structure of the original symbolic proof.

What is Love?

We made BabyAxioms parametric over loves, me, and baby, which means that these apply in any universe where love, me, and baby follow the rules of the song lyrics.

Essentially this means that for any binary relationship Loves x y, if that relationship follows these axioms, it must be true that me is baby. No matter what that relationship actually is, concretely.

That being said, it might be fun to play around with what this might look like in concrete realizations of love, me, and my baby.

First, we could imagine that Love is completely mundane, and can be created between any two operands without any extra required data or constraints — essentially, a proxy between two phantoms:

data Love a b = Love

In this case, it’s impossible to create a BabyAxioms where me and baby are different:

data Love a b = Love

-- | me ~ baby is a cosntraint required by GHC
proxyLove :: (me ~ baby) => BabyAxioms Love me baby
proxyLove = BabyAxioms
    { everybodyLovesMyBaby = Love
    , myBabyOnlyLovesMe = \_ -> Refl
    }

The me ~ baby constraint being required by GHC is actually an interesting manifestation of the paradox itself, without an explicit proof required on our part. Alternatively, and more traditionally, we can write proxyLove :: BabyAxioms Love baby baby or proxyLove :: BabyAxioms Love me me to mean the same thing.

We can imagine another concrete universe where it is only possible to love my baby, and my baby is the singular recipient of love in this entire universe:

data LoveOnly :: k -> k -> k -> Type where
    LoveMyBaby :: LoveOnly baby x baby

onlyBaby :: BabyAxioms (LoveOnly baby) me baby
onlyBaby = BabyAxioms
    { everybodyLovesMyBaby = LoveMyBaby
    , myBabyOnlyLovesMe = \case LoveMyBaby -> Refl
    }

Now we get both axioms fulfilled for free! Basically if we ever have a LoveOnly baby x me, the only possible constructor is is LoveMyBaby :: LoveOnly baby x baby, so me must be baby!

Finally, we could imagine that love has no possible construction, with no way to construct or realize. In this case, love is the uninhabited Void:

data Love a b

In this universe, we can finally fulfil myBabyOnlyLovesMe without me being baby, because “my baby don’t love nobody but me” is vacuously true if there is no possible love. However, we cannot fulfil everybodyLovesMyBaby because no love is possible, except in the case that the universe of people (k) is also empty. But GHC doesn’t have any way to encode empty kinds, I believe (I would love to hear of any techniques if you knew of any), so we cannot realize these axioms even if forall (x :: k) is truly empty.

Note that we cannot fully encode the axioms purely as a GADT in Haskell — our LoveOnly was close, but it is too restrictive: in a fully general interpretation of the song, we want to be able to allow other recipients of love besides baby. Basically, Haskell GADTs cannot express the eliminators necessary to encode myBabyOnlyLovesMe purely structurally, as far as I am aware. But I could be wrong.

Why

Nobody who listens to this song seriously believes that the speaker is intending to convey that they are their own baby, or attempting to tantalize the listener with an unintuitive tautology. However, this is indeed a common homework assignment in predicate logic classes, and I wasn’t able to find anyone covering this yet in Haskell, so I thought might as well be the first.

Sorry, teachers of courses that teach logic through Haskell.

I’ve also been using paradox as one of my go-to LLM stumpers, and it’s actually only recently (with GPT 5) that it’s been able to get this right. Yay the future? Before this, it would get stuck on trying to define a Loves GADT, which is a dead end as previously discussed.


  1. I’m pretty sure nobody has ever used it for anything useful, but I wrote the entire decidable library around manipulating propositions like this.↩︎

by Justin Le at August 21, 2025 03:36 PM

Philip Wadler

Why are we funding this?

 

In the face of swinging funding cuts in the US, David Samuel Shiffman defends the value of scientific curiosity in American Scientist. Spotted via Boing Boing.

by Philip Wadler (noreply@blogger.com) at August 21, 2025 11:00 AM

August 19, 2025

GHC Developer Blog

GHC 9.14.1-alpha1 is now available

GHC 9.14.1-alpha1 is now available

bgamari - 2025-08-19

The GHC developers are very pleased to announce the availability of the first alpha prerelease of GHC 9.14.1. Binary distributions, source distributions, and documentation are available at downloads.haskell.org.

GHC 9.14 will bring a number of new features and improvements, including:

  • Significant improvements in specialisation:

    • The SPECIALISE pragma now allows use of type application syntax

    • The SPECIALISE pragma can be used to specialise for expression arguments as well as type arguments.

    • Specialisation is now considerably more reliable in the presence of newtypes

    • the specialiser is now able to produce specialisations with polymorphic typeclass constraints, considerably broadening its scope.

  • Significant improvements in the GHCi debugger

  • Record fields can be defined to be non-linear when LinearTypes is enabled.

  • RequiredTypeArgments can now be used in more contexts

  • SSE/AVX support in the x86 native code generator backend

  • A major update of the Windows toolchain

  • … and many more

A full accounting of changes can be found in the release notes. Given the many specialisation improvements and their potential for regression, we would very much appreciate testing and performance characterisation on downstream workloads.

Due to unexpected complications, this initial prerelease comes a bit later than expected. Consequently, we expect to have three condensed alphas prior to the release candidate, in contrast to the scheduled three. We expect the next alpha will come the seek of 9 Sept. 2025, while the third will come 23 Sept. 2025, with the release candidate coming 7 Oct. 2025.

We would like to thank the Zw3rk stake pool, Well-Typed, Mercury, Channable, Tweag I/O, Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors whose on-going financial and in-kind support has facilitated GHC maintenance and release management over the years. Finally, this release would not have been possible without the hundreds of open-source contributors whose work comprise this release.

As always, do give this release a try and open a ticket if you see anything amiss.

by ghc-devs at August 19, 2025 12:00 AM

August 18, 2025

Monday Morning Haskell

Binary Tree BFS: Zigzag Order

In our last article, we explored how to perform an in-order traversal of a binary search tree. Today we’ll do one final binary tree problem to solidify our understanding of some common tree patterns, as well as the tricky syntax for dealing with a binary tree in Rust.

If you want some interesting challenge problems using Haskell data structures, you should take our Solve.hs course. In particular, you’ll learn how to write a self-balancing binary tree to use for an ordered set!

The Problem

Today we will solve Zigzag Level Order Traversal. For any binary tree, we can think about it in terms of “levels” based on the number of steps from the root. So given this tree:

-    45
    /  \
   32  50
  /  \   \
 5   40   100
    /  \
  37   43

We can visually see that there are 4 levels. So a normal level order traversal would return a list of 4 lists, where each list is a single level, ordered from left to right, visually speaking:

[45]
[32, 50]
[5, 40, 100]
[37, 43]

However, with a zigzag level order traversal, every other level is reversed. So we should get the following result for the input tree:

[45]
[50, 32]
[5, 40, 100]
[43, 37]

So we can imagine that we do the first level from left to right and then zigzag back to get the second level from right to level. Then we do left to right again for the third level, and so on.

The Algorithm

For our in-order traversal, we used a kind of depth-first search (DFS), and this approach is more common for tree-based problems. However, for a level-order problem, we want more of a breadth-first search (BFS). In a BFS, we explore states in order of their distance to the root. Since all nodes in a level have the same distance to the root, this makes sense.

Our general idea is that we’ll store a list of all the nodes from the prior level. Initially, this will just contain the root node. We’ll loop through this list, and create a new list of the values from the nodes in this list. This gets appended to our final result list.

While we’re doing this loop, we’ll also compose the list for the next level. The only trick is knowing whether to add each node’s left or right child to the next-level list first. This flips each iteration, so we’ll need a boolean tracking it that flips each time.

Once we encounter a level that produces no numbers (i.e. it only contains Nil nodes), we can stop iterating and return our list of lists.

Rust Solution

Now that we’re a bit more familiar with manipulating Rc RefCells, we’ll start with the Rust solution, framing it according to the two-loop structure in our algorithm. We’ll define stack1, which is the iteration stack, and stack2, where we accumulate the new nodes for the next layer. We also define our final result vector, a list of lists.

pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
    let mut result: Vec<Vec<i32>> = Vec::new();
    let mut stack1: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
    stack1.push(root.clone());
    let mut stack2: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
    let mut leftToRight = true;

    ...
    return result;
}

Our initial loop will continue until stack1 no longer contains any elements. So our basic condition is while(!stack1.is_empty(). However, there’s another important element here.

After we accumulate the new nodes in stack2, we want to flip the meanings of our two stacks. We want our accumulated nodes referred to by stack1, and stack2 to be an empty list to accumulate. We accomplish this in Rust by clearing stack1 at the end of our loop, and then using std::mem::swap to flip their meanings:

pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
    let mut result: Vec<Vec<i32>> = Vec::new();
    let mut stack1: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
    stack1.push(root.clone());
    let mut stack2: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
    let mut leftToRight = true;

    while (!stack1.is_empty()) {
        let mut thisLayer = Vec::new(); // Values from this level
        ...
        leftToRight = !leftToRight;
        stack1.clear();
        mem::swap(&mut stack1, &mut stack2);
    }
    return result;
}

In C++ we could accomplish something like this using std::move, but only because we want stack1 to return to an empty state.

stack2 = std::move(stack1);

Also, observe that we flip our boolean flag at the end of the iteration.

Now let’s get to work on the inner loop. This will actually go through stack1, add values to thisLayer, and accumulates the next layer of nodes for stack2. An interesting finding is that whether we’re going left to right or vice versa, we want to loop through stack2 in reverse. This means we’re treating it like a true stack instead of a vector, first accessing the last node to be added.

A left-to-right pass will add lefts and then rights. This means the right-mode node in the next layer is on “top” of the stack, at the end of the vector. A right-to-left pass will first add the right child for a node before its left. This means the left-most node of the next layer is at the end of the vector.

Let’s frame up this loop, and also add the results of this layer to our final result vector.

pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
    ...

    while (!stack1.is_empty()) {
        let mut thisLayer = Vec::new()
        for node in stack1.iter().rev() {
            ...
        }

        if (!thisLayer.is_empty()) {
            result.push(thisLayer);
        }
        leftToRight = !leftToRight;
        stack1.clear();
        mem::swap(&mut stack1, &mut stack2);
    }
    return result;
}

Note that we do not add the values array if it is empty. We allow ourselves to accumulate None nodes in our stack. The final layer we encounter will actually consist of all None nodes, and we don’t want this layer to add an empty list.

Now all we need to do is populate the inner loop. We only take action if the node from stack1 is Some instead of None. Then we follow a few simple steps:

  1. Borrow the TreeNode from this RefCell
  2. Push its value onto thisLayer.
  3. Add its children (using clone) to stack2, in the right order.

Here’s the code:

pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
    ...

    while (!stack1.is_empty()) {
        let mut thisLayer = Vec::new()
        for node in stack1.iter().rev() {
            if let Some(current) = node {
                let currentTreeNode = current.borrow();
                thisLayer.push(currentTreeNode.val);
                if leftToRight {
                    stack2.push(currentTreeNode.left.clone());
                    stack2.push(currentTreeNode.right.clone());
                } else {
                    stack2.push(currentTreeNode.right.clone());
                    stack2.push(currentTreeNode.left.clone());
                } 
            }
        }

        ...
    }
    return result;
}

And now we’re done! Here’s the full solution:

use std::rc::Rc;
use std::cell::RefCell;
use std::mem;

pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
    let mut result: Vec<Vec<i32>> = Vec::new();
    let mut stack1: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
    stack1.push(root.clone());
    let mut stack2: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();

    let mut leftToRight = true;
    while (!stack1.is_empty()) {
        let mut thisLayer = Vec::new();
        for node in stack1.iter().rev() {
            if let Some(current) = node {
                let currentTreeNode = current.borrow();
                thisLayer.push(currentTreeNode.val);
                if leftToRight {
                    stack2.push(currentTreeNode.left.clone());
                    stack2.push(currentTreeNode.right.clone());
                } else {
                    stack2.push(currentTreeNode.right.clone());
                    stack2.push(currentTreeNode.left.clone());
                } 
            }
        }
            
        if (!thisLayer.is_empty()) {
            result.push(thisLayer);
        }
        leftToRight = !leftToRight;
        stack1.clear();
        mem::swap(&mut stack1, &mut stack2);
    }
    return result;
}

Haskell Solution

While our Rust solution was better described from the outside in, it’s easy to build the Haskell solution from the inside out. We have two loops, and we can start by defining the inner loop (we’ll call it the stack loop).

The goal of this loop is to take stack1 and turn it into stack2 (the next layer) and the numbers for this layer, while also tracking the direction of iteration. Both outputs are accumulated as lists, so we have inputs for them as well:

zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = ...
  where
    stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
    stackLoop isLeftToRight stack1 stack2 nums = ...

When stack1 is empty, we return our result from this loop. Because of list accumulation order, we reverse nums when giving the result. However, we don’t reverse stack2, because we want to iterate starting from the “top”. This seems like the opposite of what we did in Rust, because Rust uses a vector for its stack type, instead of a singly linked list!

zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = ...
  where
    stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
    stackLoop _ [] stack2 nums = (stack2, reverse nums)
    stackLoop isLeftToRight (Nil : rest) stack2 numbers = stackLoop isLeftToRight rest stack2 numbers
    stackLoop isLeftToRight (Node x left right : rest) stack2 nums = ...

Observe also a second edge case…for Nil nodes in stack1, we just recurse on the rest of the list. Now for the main case, we just define the new stack2, which adds the child nodes in the correct order. Then we recurse while also adding x to nums.

zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = ...
  where
    stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
    stackLoop _ [] stack2 nums = (stack2, reverse nums)
    stackLoop isLeftToRight (Nil : rest) stack2 numbers = stackLoop isLeftToRight rest stack2 numbers
    stackLoop isLeftToRight (Node x left right : rest) stack2 nums =
      let stack2' = if isLeftToRight then right : left : stack2 else left : right : stack2
      in  stackLoop isLeftToRight rest stack2' (x : nums)

    ...

Now we’ll define the outer loop, which we’ll call the layerLoop. This takes the direction flag and stack1, plus the accumulator list for the results. It also has a simple base case to reverse the results list once stack1 is empty.

zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = layerLoop True [root] []
  where
    stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
    stackLoop = ...

    layerLoop :: Bool -> [TreeNode] -> [[Int]] -> [[Int]]
    layerLoop _ [] allNums = reverse allNums
    layerLoop isLeftToRight stack1 allNums = ...

Now in the recursive case, we call the stackLoop to get our new numbers and the stack for the next layer (which we now think of as our new stack1). We then recurse, flipping the boolean flags and adding these new numbers to our results, but only if the list is not empty.

zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = layerLoop True [root] []
  where
    stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
    stackLoop = ...

    layerLoop :: Bool -> [TreeNode] -> [[Int]] -> [[Int]]
    layerLoop _ [] allNums = reverse allNums
    layerLoop isLeftToRight stack1 allNums =
      let (stack1', newNums) = stackLoop isLeftToRight stack1 [] []
      in  layerLoop (not isLeftToRight) stack1' (if null newNums then allNums else newNums : allNums)

The last step, as you seen is calling layerLoop from the start with root. We’re done! Here’s our final implementation:

zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = layerLoop True [root] []
  where
    stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
    stackLoop _ [] stack2 nums = (stack2, reverse nums)
    stackLoop isLeftToRight (Nil : rest) stack2 numbers = stackLoop isLeftToRight rest stack2 numbers
    stackLoop isLeftToRight (Node x left right : rest) stack2 nums =
      let stack2' = if isLeftToRight then right : left : stack2 else left : right : stack2
      in  stackLoop isLeftToRight rest stack2' (x : nums)
    
    layerLoop :: Bool -> [TreeNode] -> [[Int]] -> [[Int]]
    layerLoop _ [] allNums = reverse allNums
    layerLoop isLeftToRight stack1 allNums =
      let (stack1', newNums) = stackLoop isLeftToRight stack1 [] []
      in  layerLoop (not isLeftToRight) stack1' (if null newNums then allNums else newNums : allNums)

Conclusion

That’s all we’ll do for binary trees right now. In the coming articles we’ll continue to explore more data structures as well as some common algorithms. If you want to learn more about data structures in algorithms in Haskell, check out our course Solve.hs. Modules 2 & 3 are filled with this sorts of content, including lots of practice problems.

by James Bowen at August 18, 2025 08:30 AM

GHC Developer Blog

GHC 9.10.3-rc4 is now available

GHC 9.10.3-rc4 is now available

wz1000 - 2025-08-18

The GHC developers are very pleased to announce the availability of the fourth release candidate for GHC 9.10.3. Binary distributions, source distributions, and documentation are available at downloads.haskell.org and via GHCup.

GHC 9.10.3 is a bug-fix release fixing over 50 issues of a variety of severities and scopes. A full accounting of these fixes can be found in the release notes. As always, GHC’s release status, including planned future releases, can be found on the GHC Wiki status.

The changes from the first release candidate are:

  • A fix for a rare segfault with code involving STM (#26205)
  • A fix for the naturalAndNot returning bogus results (#26205)
  • A fix for a crash in the renamer (#25056)

This release candidate will have a two-week testing period. If all goes well the final release will be available the week of 1 September 2025.

We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable, Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors whose on-going financial and in-kind support has facilitated GHC maintenance and release management over the years. Finally, this release would not have been possible without the hundreds of open-source contributors whose work comprise this release.

As always, do give this release a try and open a ticket if you see anything amiss.

by ghc-devs at August 18, 2025 12:00 AM

August 14, 2025

Gabriella Gonzalez

Type inference for plain data

Type inference for plain data using Monoids

The context behind this post is that my partner asked me how to implement type inference for plain data structures (e.g. JSON or YAML) which was awfully convenient because this is something I’ve done a couple of times already and there is a pretty elegant trick for this I wanted to share.

Now, normally type inference and unification are a bit tricky to implement in a programming language with functions, but they’re actually fairly simple to implement if all you have to work with is plain data. To illustrate this, I’ll implement and walk through a simple type inference algorithm for JSON-like expressions.

For this post I’ll use the Value type from Haskell’s aeson package, which represents a JSON value1:

data Value
    = Object (KeyMap Value)  -- { "key₀": value₀, "key₁": value₁, … }
    | Array (Vector Value)   -- [ element₀, element₁, … ]
    | String Text            -- e.g. "example string"
    | Number Scientific      -- e.g. 42.0
    | Bool Bool              -- true or false
    | Null                   -- null

I’ll also introduce a Type datatype to represent the type of a JSON value, which is partially inspired by TypeScript:

import Data.Aeson.KeyMap (KeyMap)

data Type
    = ObjectType (KeyMap Type)  -- { "key₀": type₀, "key₁": type₁, … }
    | ArrayType Type            -- type[]
    | StringType                -- string
    | NumberType                -- number
    | BoolType                  -- boolean
    | Optional Type             -- null | type
    | Never                     -- never, the subtype of all other types
    | Any                       -- any, the supertype of all other types
    deriving (Show)

… and the goal is that we want to implement an infer function that has this type:

import Data.Aeson (Value(..))

infer :: Value -> Type

I want to walk through a few test cases before diving into the implementation, otherwise it might not be clear what the Type constructors are supposed to represent:

>>> -- I'll use the usual `x : T` syntax to denote "`x` has type `T`"
>>> -- I'll also use TypeScript notation for the types

>>> -- "example string" : string
>>> infer (String "example string")
StringType

>>> -- true : boolean
>>> infer (Bool True)
BoolType

>>> -- false : boolean
>>> infer (Bool False)
BoolType

>>> -- 42 : number
>>> infer (Number 42)
NumberType

>>> -- [ 2, 3, 5 ] : number[]
>>> infer (Array [Number 2, Number 3, Number 5])
ArrayType NumberType

>>> -- [ 2, "hello" ] : any[]
>>> -- To keep things simple, we'll differ from TypeScript and not infer
>>> -- a type like (number | string)[].  That's an exercise for the reader.
>>> infer (Array [Number 2, String "hello"])
ArrayType Any

>>> -- [] : never[]
>>> infer (Array [])
ArrayType Never

>>> -- { "key₀": true, "key₁": 42 } : { "key₀": bool, "key₁": number }
>>> infer (Object [("key₀", Bool True), ("key₁", Number 42)])
ObjectType [("key₀", BoolType), ("key₁", NumberType)]

>>> -- [{ "key₀": true }, { "key₁": 42 }] : { "key₀": null | bool, "key₁": null | bool }[]
>>> infer (Array [Object [("key₀", Bool True)], Object [("key₁", Number 42)]]) 
ArrayType (ObjectType (fromList [("key₀",Optional BoolType),("key₀",Optional NumberType)]))

>>> -- null : null | never
>>> infer Null
Optional Never

>>> -- [ null, true ] : (null | boolean)[]
>>> infer (Array [Null, Bool True])
ArrayType (Optional Bool)

Some of those test cases correspond almost 1-to-1 with the implementation of infer, which we can begin to implement:

infer :: Value -> Type
infer (String _) = StringType
infer (Bool _) = BoolType
infer (Number _) = NumberType
infer Null = Optional Never

The main two non-trivial cases are the implementation of infer for Objects and Arrays.

We’ll start with Objects since that’s the easier case to infer. To infer the type of an object we infer the type of each field and then collect those field types into the final object type:

infer (Object fields) = ObjectType (fmap infer fields)

The last tricky bit to implement is the case for Arrays. We might start with something like this:

infer (Array elements) = ArrayType ???

… but what goes in the result? This is NOT correct:

infer (Array elements) = ArrayType (fmap infer elements)

… because there can only be a single element type for the whole array. We can infer the type of each element, but if those element types don’t match then we need some way to unify those element types into a single element type representing the entire array. In other words, we need a function with this type:

unify :: Vector Type -> Type

… because if we had such function then we could write:

infer (Array elements) = ArrayType (unify (fmap infer elements))

The trick to doing this is that we need to implement a Monoid instance and Semigroup instance for Type, which is the same as saying that we need to define two functions:

-- The default type `unify` returns if our list is empty
mempty :: Type

-- Unify two types into one
(<>) :: Type -> Type -> Type

… because if we implement those two functions then our unify function becomes … fold!

import Data.Foldable (fold)
import Data.Vector (Vector)

unify :: Vector Type -> Type
unify = fold

The documentation for fold explains how it works:

Given a structure with elements whose type is a Monoid, combine them via the monoid’s (<>) operator.

Laws

There are a few rules we need to be aware of when implementing mempty and (<>) which will help ensure that our implementation of unification is well-behaved.

First, mempty and (<>) must obey the “Monoid laws”, which require that:

-- Left identity
mempty <> x = x

-- Right identity
x <> mempty = x

-- Associativity
x <> (y <> z) = (x <> y) <> z

Second, mempty and (<>) must additionally obey the following unification laws:

  • mempty is a subtype of x, for all x
  • x <> y is a supertype of both x and y

Unification

mempty is easy to implement since according to the unification laws mempty must be the universal subtype, which is the Never type:

instance Monoid Type where
    mempty = Never

(<>) is the more interesting function to implement, and we’ll start with the easy cases:

instance Semigroup Type where
    StringType <> StringType = StringType
    NumberType <> NumberType = NumberType
    BoolType <> BoolType = BoolType

If we unify any scalar type with itself, we get back the same type. That’s pretty self-explanatory.

The next two cases are also pretty simple:

    Never <> other = other
    other <> Never = other

If we unify the Never type with any other type, then we get the other type because Never is a subtype of every other type.

The next case is slightly more interesting:

    ArrayType left <> ArrayType right = ArrayType (left <> right)

If we unify two array types, then we unify their element types. But what about Optional types?

    Optional left <> Optional right = Optional (left <> right)

    Optional left <> right = Optional (left <> right)
    left <> Optional right = Optional (left <> right)

If we unify two Optional types, then we unify their element types, but we also handle the case where only one or the other type is Optional, too.

The last complex data type is objects, which has the most interesting implementation:

    ObjectType left <> ObjectType right =
        ObjectType (KeyMap.alignWith adapt left right)
      where
        adapt (This (Optional a)) = Optional a
        adapt (That (Optional b)) = Optional b
        adapt (This a) = Optional a
        adapt (That b) = Optional b
        adapt (These a b) = a <> b

You can read that as saying “to unify two objects, unify the types of their respective fields, and if either object has an extra field not present in the other object then wrap the field’s type in Optional”.

Finally, we have the case of last resort:

    _ <> _ = Any

If we try to unify two types that could not unify via the previous rules, then fall back to Any (the supertype of all other types).

This gives us our final program (which I’ll included in its entirety here):

import Data.Aeson (Value(..))
import Data.Aeson.KeyMap (KeyMap)
import Data.Foldable (fold)
import Data.These (These(..))
import Data.Vector (Vector)

import qualified Data.Aeson.KeyMap as KeyMap

data Type
    = ObjectType (KeyMap Type)  -- { "key₀": type₀, "key₁": type₁, … }
    | ArrayType Type            -- type[]
    | StringType                -- string
    | NumberType                -- number
    | BoolType                  -- boolean
    | Optional Type             -- null | type
    | Never                     -- never, the subtype of all other types
    | Any                       -- any, the supertype of all other types
    deriving (Show)

infer :: Value -> Type
infer (String _) = StringType
infer (Bool _) = BoolType
infer (Number _) = NumberType
infer Null = Optional Never
infer (Object fields) = ObjectType (fmap infer fields)
infer (Array elements) = ArrayType (unify (fmap infer elements))

unify :: Vector Type -> Type
unify = fold

instance Monoid Type where
    mempty = Never

instance Semigroup Type where
    StringType <> StringType = StringType
    NumberType <> NumberType = NumberType
    BoolType <> BoolType = BoolType

    Never <> other = other
    other <> Never = other

    ArrayType left <> ArrayType right = ArrayType (left <> right)

    Optional left <> Optional right = Optional (left <> right)

    Optional left <> right = Optional (left <> right)
    left <> Optional right = Optional (left <> right)

    ObjectType left <> ObjectType right =
        ObjectType (KeyMap.alignWith adapt left right)
      where
        adapt (This (Optional a)) = Optional a
        adapt (That (Optional b)) = Optional b
        adapt (This a) = Optional a
        adapt (That b) = Optional b
        adapt (These a b) = a <> b

    _ <> _ = Any

Pretty simple! That’s the complete implementation of type inference and unification.

Unification laws

I mentioned that our implementation should satisfy the Monoid laws and unification laws, so I’ll include some quick proof sketches (albeit not full formal proofs), starting with the unification laws.

Let’s start with the first unification law:

  • mempty is the subtype of x, for all x

This is true because we define mempty = Never and Never is the subtype of all other types.

Next, let’s show that the implementation of (<>) satisfies the other unification law:

  • x <> y is a super type of both x and y

The first case is:

    StringType <> StringType = StringType

This satisfies the unificaiton law because if we replace both x and y with StringType we get:

  • StringType <> StringType is a supertype of both StringType and StringType

… and since StringType <> StringType = StringType that simplifies down to:

  • StringType is a supertype of both StringType and StringType

… and every type is a supertype of itself, so this satisfies the unification law.

We’d prove the unification law for the next two cases in the exact same way (just replacing StringType with NumberType or BoolType):

    NumberType <> NumberType = NumberType
    BoolType <> BoolType = BoolType

What about the next case:

    Never <> other = other

Well, if we take our unification law and replace x with Never and replace y with other we get:

  • Never <> other is a supertype of Never and other

… and since Never <> other = other that simplifies to:

  • other is a supertype of Never and other

… which is true because:

  • other is a supertype of Never (because Never is the universal subtype)
  • other is a supertype of other (because every type is a supertype of itself)

We’d prove the next case in the exact same way (just swapping Never and other):

    other <> Never = other

For the next case:

    ArrayType left <> ArrayType right = ArrayType (left <> right)

The unification law becomes:

  • ArrayType (left <> right) is a supertype of both ArrayType left and ArrayType right

… which is true because ArrayType is covariant and by induction left <> right is a supertype of both left and right.

We’d prove the first case for Optional in the exact same way (just replace Array with Optional):

    Optional left <> Optional right = Optional (left <> right)

The next case for Optional is more interesting:

    Optional left <> right = Optional (left <> right)

Here the unification law would be:

  • Optional (left <> right) is a supertype of Optional left and right

… which is true because:

  • Optional (left <> right) is a supertype of Optional left

    This is true because Optional is covariant and left <> right is a supertype of left

  • Optional (left <> right) is a supertype of right

    This is true because:

    • Optional (left <> right) is a supertype of Optional right
    • Optional right is a supertype of right
    • Therefore, by transitivity, Optional (left <> right) is a supertype of right

We’d prove the next case in the same, just switching left and right:

    left <> Optional right = Optional (left <> right)

The case for objects is the most interesting case:

    ObjectType left <> ObjectType right =
        ObjectType (KeyMap.alignWith adapt left right)
      where
        adapt (This (Optional a)) = Optional a
        adapt (That (Optional b)) = Optional b
        adapt (This a) = Optional a
        adapt (That b) = Optional b
        adapt (These a b) = a <> b

I won’t prove this case as formally, but the basic idea is that this is true because a record type (A) is a supertype of another record type (B) if and only if:

  • for each field k they share in common, A.k is a supertype of B.k
  • for each field k present only in A, A.k is a supertype of Optional Never
  • there are no fields present only in B

… and given that definition of record subtyping then the above implementation satisfies the unification law.

Monoid laws

The first two Monoid laws are trivial to prove:

mempty <> x = x

x <> mempty = x

… because we defined:

    mempty = Never

… and if we replace mempty with Never in those laws:

Never <> x = x
x <> Never = x

… that is literally what our code defines (except replacing x with other):

    Never <> other = other
    other <> Never = other

The last law, associativity, is pretty tedious to prove in full:

(x <> y) <> z = x <> (y <> z)

… but I’ll do a few cases to show how the basic gist of how the proof works.

First, the associativity law is easy to prove for the case where any of x, y, or z is Never. For example, if x = Never, then we get:

(Never <> y) <> z = Never <> (y <> z)

-- Never <> other = other
y <> z = y <> z

… which is true. The other two cases for y = Never and z = Never are equally simple to prove.

Associativity is also easy to prove when any of x, y, or z is Any. For example, if x = Any, then we get:

(Any <> y) <> z = Any <> (y <> z)

-- Any <> other = other
Any <> z = Any

-- Any <> other = other
Any = Any

… which is true. The other two cases for y = Any and Z = Any are equally simple to prove.

Now we can prove associativity if any of x, y or z is StringType. The reason why is that these are the only relevant cases in the implementation of unification for StringType:

StringType <> StringType = StringType

StringType <> Never = StringType
Never <> StringType = StringType

StringType <> _ = Any
_ <> StringType = Any

… but we already proved associativity for all cases involving a Never, so we don’t need to consider the second case, which simplifies things down to:

StringType <> StringType = StringType

StringType <> _ = Any
_ <> StringType = Any

That means, that there are only seven cases we need to consider to prove the associativity laws if at least one of x, y, and z is StringType (using _ below to denote “any type other than StringType):

-- true: both sides evaluate to StringType
(StringType <> StringType) <> StringType = StringType <> (StringType <> StringType)

-- all other cases below are also true: they all evaluate to `Any`
(StringType <> StringType) <> _          = StringType <> (StringType <> _         )
(StringType <> _         ) <> StringType = StringType <> (_          <> StringType)
(StringType <> _         ) <> _          = StringType <> (_          <> _         )
(_          <> StringType) <> StringType = _          <> (StringType <> StringType)
(_          <> StringType) <> _          = _          <> (StringType <> _         )
(_          <> _         ) <> StringType = _          <> (_          <> StringType)

We can similarly prove associativity for all cases involving at least one NumberType or BoolType.

The proof for ArrayType is almost the same as the proof for StringType/NumberType/BoolType. The only relevant cases are:

ArrayType left <> ArrayType right = ArrayType (left <> right)

ArrayType left <> Never = ArrayType
Never <> ArrayType right = ArrayType

ArrayType left <> _ = Any
_ <> ArrayType right = Any

Just like before, we can ignore the case where either argument is Never because we already proved associativity for that. That just leaves:

ArrayType left <> ArrayType right = ArrayType (left <> right)

ArrayType left <> _ = Any
_ <> ArrayType right = Any

Just like before, there are only seven cases we have to prove (using _ below to denote “any type other than ArrayType):

ArrayType x <> (ArrayType y <> ArrayType z) = (ArrayType x <> ArrayType y) <> ArrayType z
-- … simplifies to:
ArrayType (x <> (y <> z)) = ArrayType ((x <> y) <> z)
-- … which is true because unification of the element types is associative

-- all other cases below are also true: they all evaluate to `Any`
(ArrayType x <> ArrayType y) <> _           = ArrayType x <> (ArrayType y <> _          )
(ArrayType x <> _          ) <> ArrayType z = ArrayType x <> (_           <> ArrayType z)
(ArrayType x <> _          ) <> _           = ArrayType x <> (_           <> _          )
(_           <> ArrayType y) <> ArrayType z = _           <> (ArrayType y <> ArrayType z)
(_           <> ArrayType y) <> _           = _           <> (ArrayType y <> _          )
(_           <> _          ) <> ArrayType z = _           <> (_           <> ArrayType z)

The proofs for the Optional and Object cases are longer and more laborious so I’ll omit them. They’re an exercise for the reader because I am LAZY.


  1. I’ve inlined all the type synonyms and removed strictness annotations, for clarity↩︎

by Gabriella Gonzalez (noreply@blogger.com) at August 14, 2025 03:58 AM

Edward Z. Yang

State of torch.compile for training (August 2025)

The purpose of this post is to sum up, in one place, the state of torch.compile for training as of August 2025. Nothing in here isn't something you might not already know about from elsewhere on the Internet, but we rarely put everything together in one place. The target audience for this document are teams who are evaluating the use of torch.compile for large scale training runs.

First, the basics. torch.compile (also known as PT2) is a compiler for PyTorch eager programs for both inference and training workloads. Speedups from 1.5-2x compared to eager code are typical, and torch.compile also makes it possible to do global optimizations for memory (e.g., automatic activation checkpointing) and distributed communications (e.g., async tensor parallelism).

What is torch.compile's functionality?

The headline functionality of torch.compile is a decorator you can attach to a function to compile it:

@torch.compile()
def f(x, y):
    ...

Here are some non-functional properties of compile which are important to know:

  • Just-in-time compilation. We don't actually compile the function until it is called for the first time, and execution blocks until compilation completes. There is both local and remote caching to skip compilation cost when you rerun the model. (Ahead-of-time compilation is possible for inference with AOTInductor, and is being worked on for training.)
  • Compositional with Eager. PyTorch's original success comes from the extreme hackability of eager mode, and torch.compile seeks to preserve this. The function can be as big or as small part of your training loop as you like; compiled functions compose with autograd, DDP, FSDP and other PyTorch subsystems. (This composition is sometimes imperfect, e.g., in the case of double backwards (not supported), tensor subclasses (requires specific support from the subclass), autograd (differentiating with respect to intermediates returned from a compiled region does not work).) If compilation doesn't work on a region, you can disable it entirely with torch.compiler.disable() and fall back to eager.
  • Gradient updates are delayed to the end of compiled regions. This arises because PyTorch eager autograd does not support streaming gradients incrementally from a large backward node. (This can be solved by using compiled autograd, but this requires that the entirety of your backwards be compileable.)
  • Graphs may be recompiled. We aggressively specialize on all non-Tensor arguments/globals used in the function to ensure we always generate straight-line computation graphs with no control flow. If those arguments/globals change we will recompile the graph. (Recompilations can be banned with torch._dynamo.config.error_on_recompile = True.)
  • Static by default, recompile to dynamic shapes. We aggressively specialize all sizes to static. However, if we discover that a size varies over time, on the first recompile we will attempt to generate a single compiled region that handles dynamic shapes. We are not guaranteed to be able to compile a model with dynamic shapes. (You can use mark_dynamic to force an input shape to be dynamic, and you can use mark_unbacked to error if we specialize.)
  • Graph breaks transparently bypass non-capturable code. By default, if the compiler encounters a line of code that it is not able to handle, it will trigger a graph break, disabling compilation for that line of code, but still attempting to compile regions before and after it. (This behavior can be banned with fullgraph=True.)
  • Function calls are inlined and loops are unrolled by default. If you have many copies of a Transformer block in your model, your compile time will scale with the number of Transformer blocks. (You can reduce compile time by doing "regional compilation", where you only compile the Transformer block instead of compiling the entire model.)
  • NOT bitwise equivalent with eager PyTorch. The biggest divergence with eager PyTorch is that when float16/bfloat16 operations are fused together, we do not insert redundant down/up-conversions. (This can be disabled torch._inductor.config.emulate_precision_casts = True; you can also rewrite eager code to perform operations in higher precision with the understanding torch.compile will optimize it. XLA has a similar config xla_allow_excess_precision which JAX enables by default.) However, we may also make decisions to swap out, e.g., matmul implementations, and there may also be slight divergence that arise from differences in reduction ordering that are unavoidable when compilation occurs. We support ablating the graph capture frontend separately from the compiler backend to help diagnose these kinds of problems.
  • Distributed collectives and DTensor can be compiled, but are unoptimized by default. We are able to capture c10d collectives and also programs that handle DTensors, but we don't apply optimizations to collectives by default. (There are experimental optimizations that can be enabled, but this is active work in progress.) We generally do not expect to be able to trace through highly optimized distributed framework code.

State of advanced parallelism

For large scale training runs, torch.compile faces stiff competition from (1) PyTorch native distributed frameworks which embrace eager mode and implement all optimizations by hand (e.g., megatron), (2) custom "compiler" stacks which reuse our tracing mechanisms (e.g., symbolic_trace and make_fx) but implement their desired passes by hand, (3) JAX, which has always been XLA first and is years ahead in compile-driven parallelism techniques.

Here is where we currently are for advanced parallelism (with an emphasis on comparing with JAX):

  • DTensor, a "global tensor" abstraction for representing sharded tensors. DTensor is a tensor subclass which allows us to represent tensors which are sharded over an SPMD device mesh. The shape of a DTensor reflects the global shape of the original full tensor, but it only stores locally a shard of the data according to the placement. Here are some important details:
    • Shard placements. Unlike JAX placements, DTensor placements are "device mesh" oriented; that is to say, you conventionally specify a device mesh dim size list of placements, and Shard(i) indicates that the ith dimension of a tensor is sharded. This is opposite of JAX, which is "tensor" oriented. For example, given a 2-D mesh ["dp", "tp"], a tensor with [Replicate, Shard(0)] in DTensor placement (or {"dp": Replicate, "tp": Shard(0)} with named device mesh axes), would correspond to a JAX placement of P("tp", None). The reason for this is that DTensor supports a Partial placement, which indicates that an axis on the device mesh has a pending reduction. Partial shows up ubiquitously from matrix multiplies, and it isn't associated with any particular tensor axis, making it more convenient to represent in a device-mesh oriented formulation. The tradeoff is that device-mesh oriented placements don't naively support specifying sharding ordering, e.g., suppose I want to shard a 1-D tensor on tp and then dp, in JAX I'd represent this as P(("tp", "dp"),) but this order cannot be disambiguated from [Shard(0), Shard(0)] and in fact DTensor always forces left-to-right sharding. There is currently a proposal to extend our sharding specification to support ordering to bring us to parity with JAX expressiveness, but it is not yet implemented.
    • Autograd. DTensor is directly differentiable; we run autograd on programs that have DTensors (as opposed to desugaring a DTensor program to one with regular Tensors and differentiating it). This ensures that the sharding strategy of a primal and its corresponding tangent can diverge. This is parity with JAX.
    • Python subclass of Tensor. Unlike JAX, DTensor is a separate subclass from Tensor. However, Tensor and DTensor interoperate fine; a Tensor can simply be thought of as a DTensor that is replicated on all dimensions. DTensor is implemented in Python, which makes it easy to modify and debug but imposes quite a bit of overhead (for example, FSDP2 does not directly accumulate gradients into DTensor, because with thousands of parameters, performing detach and add operations on DTensor is a bottleneck). Still, despite this overhead, DTensor was designed for good eager performance, and extensively caches the results of sharding propagation so that in the fastpath, it only needs to lookup what redistribute it should perform and then directly dispatches to the local eager operation. However, this caching strategy means that overhead can be quite high for workloads with dynamic shapes, as the cache requires exact matches of all input shapes.
    • Compilation. DTensor is compilable by torch.compile, and doing so will desugar it into its underlying collectives and eliminate any eager mode DTensor overhead (even if you do not perform any other optimizations.) However, DTensor with dynamic shapes in compile is not well supported, see http://github.com/pytorch/pytorch/issues/159635 (we don't think this is currently critical path for any critical use cases, so a relatively junior engineer has been chipping away at it.)
    • Greedy propagation. Because DTensor must work in eager mode, it only implements greedy shard propagation, where for every eager operation we greedily pick whatever output shard minimizes the collective costs of an operation. It is work in progress to support backward propagation of sharding with the assistance of a compiler-like framework.
    • Operator coverage. DTensor requires sharding propagation rules to work for operations. If a sharding propagation rule is not implemented, DTensor will fail rather than trigger an inefficient allgather to run the operator under replication. We don't currently have full coverage of all operators, but important operators for transformer models like llama3 are all covered (sharding rules are defined here). You can write custom shardings for user defined operators.
    • Jagged sharding. We do not support a "jagged sharding" concept which would be necessary for expert parallelism with imbalanced routing. However, we believe that our existing sharding rules could largely be reused to support such an idea. As dynamism would only be exposed in the local tensor for the jagged shard, jagged shards don't suffer from the dynamic shapes problems mentioned in the compilation section.
    • Ecosystem. We are committed to DTensor as the standard representation for sharded tensors, and DTensor is integrated with checkpointing, FSDP2, SimpleFSDP, AutoParallel, torchtitan, among others.
  • Functional collectives. If you don't like DTensor, we also support "functional collectives", which are non-mutating versions of collective operations that can be used to manually implement SPMD operations in a compiler-friendly way without needing DTensor. (In fact, if you use traditional collective APIs and compile them, we will silently translate them into functional collectives for compiler passes.) When compiled, functional collectives don't necessarily force allocation of the output buffer as they can be re-inplaced. Importantly, functional collectives currently do NOT support autograd, see https://discuss.pytorch.org/t/supporting-autograd-for-collectives/219430

  • Graph capture. There are two particularly popular graph capture mechanisms which people have used to perform distributed optimizations separate from model code. All graph capture mechanisms produce FX graphs, which are a simple Python basic block IR representation with no control flow, which is entirely unopinionated about what actual operator set can occur in the graph.
    • Symbolic_trace. This was the original graph capture mechanism and is quite popular, despite its limitations. It is implemented entirely with Python operator overloading and will give you exactly whatever operations are overloadable in the graph. We consider this largely a legacy pipeline as you are unable to trace code involving conditionals on shapes and you end up with a graph that has no useful metadata about the shapes/dtypes of intermediate values. For example, PiPPY, a legacy stack for performing pipeline parallelism, was built on top of symbolic_trace graph capture.
    • make_fx/torch.export. This graph capture mechanism works by actually sending (fake) tensors through your program and recording ATen operators. There are a number of different variants: e.g., whether or not it is a Python tracing approach ala JAX jit, or whether it uses sophisticated bytecode analysis ala Dynamo; similarly, there are various levels of IR you can extract (pre-dispatch, post-dispatch; also, operators can be decomposed or kept as single units). Our compiler parallelism efforts are built on top of this capture mechanism, but there is nothing stopping you per se from writing your own graph pass on top of this IR. In practice, this can be difficult without PyTorch expertise, because (1) integrating a traced graph into PyTorch's autograd system so it can interoperate with other code is quite complicated to do in full generality, (2) the exact operator sets you get at various phases of compilation are undocumented and in practice very tied to the Inductor lowering stack, and it is poorly documented on how to prevent operators from getting decomposed before your pass gets to them.
  • Not SPMD compiler by default. torch.compile does not assume the program being compiled is SPMD by default, which means it will not do things like drop unused collectives (you can change this behavior with a config flag). Additionally, the default mode of use for torch.compile is to compile in parallel on all nodes, which means care has to be taken to ensure that every instance of the compiler compiles identically (only one rank recompiling, or compilers making different decisions, can lead to NCCL timeout). We ultimately think that we should compile a program once and send it to all nodes, but as this is not currently implemented, the general approach people have taken to solve this problem is to either (1) eliminate all sources of divergent behavior from ranks, e.g., don't allow the compiler to look at the actual size for dynamic inputs when making compiler decisions, or (2) introducing extra collectives to the compiler to communicate decisions that must be made consistently across all ranks.

Our vision for the future of advanced parallelism, spearheaded by the in-progress SimpleFSDP and AutoParallel, is that users should write single-node programs that express mathematically what they want to do. These are then transformed into efficient distributed programs in two steps: (1) first, collectives are inserted into the graph in a naive way (i.e., simply to express what the sharding of all intermediates should be), and (2) the collectives are optimized to handle scheduling concerns such as pre-fetching and bucketing. AutoParallel sets a GSPMD style goal of automatically determining a good enough sharding for a program--it should be able to rediscover data parallel, tensor parallel, even expert parallel(!)--but SimpleFSDP sets a smaller goal of just inserting collectives in the pattern that FSDP would mandate, and then writing FSDP-specific optimization passes for recovering FSDP2's performance. It is very common to write domain specific optimizations; for example, async tensor parallelism is also implemented as a pass that detects TP patterns and rewriting them to async TP operations. Unlike JAX, which started with a very generic solver and has needed to add more manual escape hatches over time, PyTorch has started with writing all of the distributed patterns exactly by hand, and we are only recently adding more automatic mechanisms as an alternative to doing everything by hand.

State of optimization

torch.compile performs many optimizations, but here are some particularly important ones to know about:

  • Inductor. Inductor is our backend for torch.compile that generates Triton kernels for PyTorch programs. It has very good coverage of PyTorch's operator set and can do fusions of pointwise and reductions, including in the patterns that typically occur for backwards. It also is able to fuse pointwise operations into matmuls and autotune different matmul backends (including cuBlas, cutlass and Triton) to select the best one for any given size. When people talk about torch.compile speeding up their programs, they are conventionally talking about Inductor; however, you don't have to use torch.compile with Inductor; for example, you could run with AOTAutograd only and skip Inductor compilation.
  • CUDA graphs. Inductor builds in support for CUDA graphing models. Unlike manual CUDA graphs application, we can give better soundness guarantees than manual CUDA graphs application (e.g., forgetting to copy in all input buffers, CPU compute inside the CUDA graph region). torch.compile CUDA graphs is typically used with Inductor but we also offer an eager-only cudagraphs integration (that is less well exercised).
  • Automatic activation checkpointing. With torch.compile, we can globally optimize the memory-compute tradeoff, much better than the activation checkpointing APIs that eager PyTorch supports (and require the user to manually feed in what they want checkpointed or not). However, some folks have reported that it can be quite miserable tuning the hyperparameter for AC; we have also found bugs in it.
  • FP8 optimizations. One big success story for traditional compilation was adding support for a custom FP8 flavor. With torch.compile, they didn't have to write manual kernels for their variant. This has since been upstreamed to torchao.
  • Flex attention. Flex attention usage continues to grow, with 632 downstream repo users in OSS (vs 125 in Jan '25). It has been used to enable chunked attention, document masking and context parallelism in llama family models. It is a really good research tool, although sometimes people complain about slight numerical differences.
  • Helion. Helion is an actively developed project aiming to go beta in October this year which offers a higher level interface for programming Triton kernels that looks just like writing PyTorch eager code. It relies heavily on autotuning to explore the space of possible structural choices of kernels to find the best one. It is not production ready but it is worth knowing that it is coming soon.

State of compile time

torch.compile is a just-in-time compiler and as such, in its default configuration, compilation will occur on your GPU cluster (preventing you from using the GPUs to do other useful work!) In general, most pathological compile times arise from repeated recompilation (often due to dynamic shapes, but sometimes not). In Transformer models, compile time can also be improved by only compiling the Transformer block (which can then be compiled only once, instead of having to be compiled N times for each Transformer block in the model).

We don't think caching is an ideal long-term solution for large scale training runs, and we have been working on precompile to solve the gap here. Precompile simply means having compilation be an ahead-of-time process which produces a binary which you can directly run from your training script to get the compiled model. The compilation products are built on top of our ABI stable interface (developed for AOTInductor) which allows the same binaries to target multiple PyTorch versions, even though PyTorch the library does not offer ABI compatibility from version to version.

How do I get started?

The most typical pattern we see for people who want to make use of torch.compile for large-scale training is to fork torchtitan and use this codebase as the basis for your training stack. torchtitan showcases PyTorch native functionality, including torch.compile--in effect, it shows you how to use features in PyTorch together in a way that lets you do large-scale training. From there, swap out the components you are opinionated about and keep the things you don't care about.

by Edward Z. Yang at August 14, 2025 02:33 AM

Tweag I/O

Performance Testing, Part 1: The Road to Continuous Performance Testing

The performance of a system is critical for the user experience. Whether it’s a website, mobile app, or service, users demand fast response times and seamless functionality. Performance testing is a non-functional testing technique that evaluates the speed, responsiveness, and stability of a system under different workloads for different purposes. The primary goal of performance testing is to identify and eliminate performance bottlenecks to ensure that the system meets the expected performance criteria. It is crucial for understanding the performance of the system under various conditions and ensuring that it can handle real-world usage scenarios effectively. From my experience, performance testing is usually underestimated and overlooked, as it is generally only run after big feature releases, architectural changes, or when preparing for promotional events. In this post, I want to explain the foundations of performance testing for the wider engineering community. In a future post, I’ll talk about continuous performance testing.

Performance testing helps in:

  • Validating System Performance: Ensuring that the system performs well under expected load conditions.
  • Identifying Bottlenecks: Detecting performance issues that could degrade the user experience.
  • Ensuring Scalability: Verifying that the system can scale to accommodate increased load, and also decreasing load.
  • Improving User Experience: Providing a smooth and responsive experience constantly for end-users to increase loyalty.

Performance Testing process

Like other software development activities, for performance testing to be effective it should be done through a process. The process requires collaboration with other teams such as business, DevOps, system, and development teams.

Performance Testing Process

Let’s explain the process with a real-world scenario. Imagine Wackadoo Corp wants to implement performance testing because they’ve noticed their e-commerce platform slows down dramatically during peak sales events, leading to frustrated customers and lost revenue. When this issue is raised to the performance engineers, they suspect it could be due to inadequate server capacity or inefficient database queries under heavy load and recommend running performance tests to pinpoint the problem. The engineers begin by gathering requirements, such as simulating 10,000 concurrent users while maintaining response times under 2 seconds, and then create test scripts to mimic real user behavior, like browsing products and completing checkouts.

A testing environment mirroring production is set up, and the scripts are executed while the system is closely monitored to ensure it handles the expected load. After the first test run, the engineers analyze the results and identify slow database queries as the primary bottleneck. They optimize the queries, add caching, and re-run the tests, repeating this process until the system meets all performance criteria. Once satisfied, they publish the final results, confirming the platform can now handle peak traffic smoothly, improving both customer experience and sales performance.

How to Apply Performance Testing

Like functional testing, performance testing should be integrated at every level of the system, starting from the unit level up. The test pyramid traditionally illustrates functional testing, with unit tests at the base, integration tests in the middle, and end-to-end or acceptance tests at the top. However, the non-functional aspect of testing—such as performance testing—often remains less visible within this structure. It is essential to apply appropriate non-functional tests at each stage to ensure a comprehensive evaluation. By conducting tailored performance tests across different levels, we can obtain early and timely feedback, enabling continuous assessment and improvement of the system’s performance.

Performance Testing for Test Levels

Types of Performance Testing

There are several types of performance tests, each designed to evaluate different aspects of system performance. We can basically categorize performance testing with three main criteria:

  • Load; for example, the number of virtual users
  • The strategy for varying the load over time
  • How long we apply performance testing

The following illustrates the different types of performance testing with regards to the three main criteria.

Performance Testing Types

The three main criteria are a good starting point, but they don’t completely characterize the types performance tests. For example, we can also vary the type of load (for example, to test CPU-bound or I/O-heavy tasks) or the testing environment (for example, whether the system is allowed to scale up the number of instances).

Load Testing

Load testing is a basic form of performance testing that evaluates how a system behaves when subjected to a specific level of load. This specific load represents the optimal or expected amount of usage the system is designed to handle under normal conditions. The primary goal of load testing is to verify whether the system can deliver the expected responses while maintaining stability over an extended period. By applying this consistent load, performance engineers can observe the system’s performance metrics, such as response times, resource utilization, and throughput, to ensure it functions as intended.

  • Basic and widely known form of performance testing
  • Load tests are run under the optimum load of the system
  • Load tests give a result that real users might face in production
  • Easiest type to run in a CI/CD pipeline

Let’s make it clearer by again looking at Wackadoo Corp. Wackadoo Corp wants to test that a new feature is performing similarly to the system in production. The business team and performance engineers have agreed that the new feature should meet the following requirements while handling 5,000 concurrent users:

  • It can handle 1,000 requests per second (rps)
  • 95% of the response times are less than 1,000 ms
  • Longest responses are less then 2,000 ms
  • 0% error rate
  • The test server is not exceeding 70% of CPU usage with 4GB of RAM

With these constraints in place, Wackadoo Corp can deploy the new feature in a testing environment and observe how it performs.

Stress Testing

Stress testing evaluates a system’s upper limits by pushing it beyond normal operation to simulate extreme conditions like high traffic or data processing. It identifies breaking points and assesses the system’s ability to recover from failures. This testing uncovers weaknesses, ensuring stability and performance during peak demand, and improves reliability and fault tolerance.

  • Tests the upper limits of the system
  • Requires more resources than load testing, to create more virtual users, etc.
  • The boundary of the system should be investigated during the stress test
  • Stress tests can break the system
  • Stress tests can give us an idea about the performance of the system under heavy loads, such as promotional events like Black Friday
  • Hard to run in a CI/CD pipeline since the system is intentionally prone to fail

Wackadoo Corp wants to investigate the system behavior when exceeding the optimal users/responses so it decides to run a stress test. Performance engineers have the metrics for the upper limit of the system, so during the tests the load will be increased gradually until the peak level. The system can handle up to 10,000 concurrent users. The expectation is that the system will continue to respond, but the response metrics will degrade within the following expected limits:

  • It can handle 800 requests per second (rps)
  • 95% of the response times are less than 2,500 ms
  • Longest responses are less then 5,000 ms
  • 10% error rate
  • The test server is around 95% of CPU usage with 4GB of RAM

If any of these limits are exceeded when monitoring in the test environment, then Wackadoo Corp knows it has a decision to make about resource scaling and its associated costs, if no further efficiencies can be made.

Spike Testing

A spike test is a type of performance test designed to evaluate how a system behaves when there is a sudden and significant increase or decrease in the amount of load it experiences. The primary objective of this test is to identify potential system failures or performance issues that may arise when the load changes unexpectedly or reaches levels that are outside the normal operating range.

By simulating these abrupt fluctuations in load, the spike test helps to uncover weaknesses in the system’s ability to handle rapid changes in demand. This type of testing is particularly useful for understanding how the system responds under stress and whether it can maintain stability and functionality when subjected to extreme variations in workload. Ultimately, the spike test provides valuable insights into the system’s resilience and helps ensure it can manage unexpected load changes without critical failures.

  • Spike tests give us an idea about the behavior of the system under unexpected increases and decreases in load
  • We can get an idea about how fast the system can scale-up and scale-down
  • They can require additional performance testing tools, as not all tools support this load profile
  • Good for some occasions like simulating push notifications, or critical announcements
  • Very hard to run in a CI/CD pipeline since the system is intentionally prone to fail

Let’s look at an example again, Wackadoo Corp wants to send push notifications to 20% of the mobile users at 3pm for Black Friday. They want to investigate the system behavior when the number of users increase and decrease suddenly so they want to run a spike test. The system can handle up to 10,000 concurrent users, so the load will be increased to this amount in 10 seconds and then decreased to 5,000 users in 10 seconds. The expectation is that the system keeps responding, but the response metrics increase within the following expected limits:

  • Maximum latency is 500ms
  • 95% of the response times are less than 5,000 ms
  • Longest responses are less then 10,000 ms
  • 15% error rate
  • The test server is around 95% of CPU usage but it should decrease when the load decreases

Again, if any of these expectations are broken, it may suggest to Wackadoo Corp that its resources are not sufficient.

Endurance Testing (Soak Testing)

An endurance test focuses on evaluating the upper boundary of a system over an extended period of time. This test is designed to assess how the system behaves under sustained high load and whether it can maintain stability and performance over a prolonged duration.

The goal is to identify potential issues such as memory leaks, resource exhaustion, or degradation in performance that may occur when the system is pushed to its limits for an extended time. By simulating long-term usage scenarios, endurance testing helps uncover hidden problems that might not be evident during shorter tests. This approach ensures that the system remains reliable and efficient even when subjected to continuous high demand over an extended period.

  • Soak tests run for a prolonged time
  • They check the system stability when the load does not decrease for a long time
  • Soak testing can give a better idea about the performance of the system for campaigns like Black Friday than the other tests, hence the need for a diverse testing strategy
  • Hard to run in a CI/CD pipeline since it aims to test for a long period, which goes against the expected short feedback loop

This time, Wackadoo Corp wants to send push notifications to 10% of users at every hour, starting from 10am until 10pm, for Black Friday to increase sales for a one-day 50%-off promotion. They want to investigate the system behavior when the number of users increase, but the load stays stable between nominal and the upper-boundary for a long time so they want to run an endurance test. The system can handle up to 10,000 concurrent users, so the load will be increased to 8,000 users in 30 seconds and it will be kept busy. The expectation is that the system keeps responding, but the response metrics increase within the following expected limits:

  • Maximum latency is 300ms
  • 95% of the response times are less than 2,000 ms
  • Longest responses are less then 3,000 ms
  • 5% error rate
  • The test server is around 90% of CPU usage

Scalability Testing

Scalability testing is a critical type of performance testing that evaluates how effectively a system can manage increased load by incorporating additional resources, such as servers, databases, or other infrastructure components. This testing determines whether the system can efficiently scale up to accommodate higher levels of demand as user activity or data volume grows.

By simulating scenarios where the load is progressively increased, scalability testing helps identify potential bottlenecks, resource limitations, or performance issues that may arise during expansion. This process ensures that the system can grow seamlessly to meet future requirements without compromising performance, stability, or user experience. Ultimately, scalability testing provides valuable insights into the system’s ability to adapt to growth, helping organizations plan for and support increasing demands over time.

  • Scalability tests require collaboration for system monitoring and scaling
  • They can require more load generators, depending of the performance testing tools (i.e. load the system, then spike it)
  • They aim to check the behavior of the system during the scaling
  • Very hard to run in a CI/CD pipeline since it requires the scaling to be orchestrated

Performance engineers at Wackadoo Corp want to see how the system scales when the loads exceed the upper boundary, so they perform a scalability test. The system can handle up to 10,000 concurrent users for one server, so this time the load will be increased gradually starting from 5,000 users, and every 2 minutes 1,000 users will join the system. The expectation is that the system keeps responding, but the response metrics increase with the load (as before) until after 10,000 users, when a new server should join the system. At which point, we should observe the response metrics starting to decrease. Once scaling up is tested, we can continue with testing the scaling down by decreasing the number of users under the upper limit.

Volume Testing

Volume testing assesses the system’s behavior when it is populated with a substantial amount of data. The purpose of this testing is to evaluate how well the system performs and maintains stability under conditions of high data volume. By simulating scenarios where the system is loaded with large datasets, volume testing helps identify potential issues related to data handling, storage capacity, and processing efficiency.

This type of testing is particularly useful for uncovering problems such as slow response times, data corruption, or system crashes that may occur when managing extensive amounts of information. Additionally, volume testing ensures that the system can effectively store, retrieve, and process large volumes of data without compromising its overall performance or reliability.

  • Volume tests simulate the system behavior when huge amounts of data are received
  • They check if databases have any issue with indexing data
  • For example, in a Black Friday sale scenario, with a massive surge of new users accessing the website simultaneously, they ensure that no users experience issues such as failed transactions, slow response times, or an inability to access the system
  • Very hard to run in a CI/CD pipeline since the system is intentionally prone to fail

Wackadoo Corp wants to increase customers, so they implemented an “invite your friend” feature. The company plans to give a voucher to both members and invited members, which will result in a huge amount of database traffic. Performance engineers want to run a volume test, which mostly includes scenarios like inviting, registering, checking voucher code state, and loading the checkout page. During the test, the load will increase to 5,000 users by adding 1,000 users every 2 minutes and they should simulate normal user behaviors. After that heavy write operations can start. As a result, we should expect the following:

  • Maximum latency is 500ms
  • 95% of the response times are less than 3,000 ms
  • Longest responses are less then 5,000 ms
  • 0% error rate
  • The test server is around 90% of CPU usage

A failure here might suggest to Wackadoo Corp that its database service is a bottleneck.

Conclusion

Performance testing plays a crucial role in shaping the overall user experience because an application that performs poorly can easily lose users and damage its reputation. When performance problems are not detected and resolved early, the cost of fixing them later can increase dramatically, impacting both time and resources.

Moreover, collaboration between multiple departments, including development, operations, and business teams, is essential to ensure that the testing process aligns with real-world requirements and produces meaningful, actionable insights. Without this coordinated effort and knowledge base, performance testing may fail to deliver valuable outcomes or identify critical issues.

There are many distinct types of performance testing, each designed to assess the system’s behavior from a specific angle and under different conditions. Load testing can be easily adapted to the CI/CD pipeline; the other performance testing types can be more challenging, but they can still provide a lot of benefits.

In my next blog post, I will talk about my experiences on how we can apply performance testing continuously.

August 14, 2025 12:00 AM

August 13, 2025

Chris Penner

You should add debug views to your DB

You should add debug views to your DB

This one will be quick.

Imagine this, you get a report from your bug tracker:

Sophie got an error when viewing the diff after her most recent push to her contribution to the @unison/cloud project on Unison Share

(BTW, contributions are like pull requests, but for Unison code)

Okay, this is great, we have something to start with, let's go look up that contribution and see if any of the data there is suspicious.

Uhhh, okay, I know the error is related to one of Sophie's contributions, but how do I actually find it?

I know Sophie's username from the bug report, that helps, but I don't know which project she was working on, or what the contribution ID is, which branches are involved, etc. Okay no problem, our data is relational, so I can dive in and figure it out with a query:

> SELECT 
  contribution.* 
  FROM contributions AS contribution
  JOIN projects AS project 
    ON contribution.project_id = project.id
  JOIN users AS unison_user 
    ON project.owner = unison_user.id
  JOIN users AS contribution_author 
    ON contribution.author_id = contribution_author.id
  JOIN branches AS source_branch 
    ON contribution.source_branch = source_branch.id
  WHERE contribution_author.username = 'sophie'
    AND project.name = 'cloud'
    AND unison_user.username = 'unison'
  ORDER BY source_branch.updated_at DESC

-[ RECORD 1 ]--------+----------------------------------------------------
id                   | C-4567
project_id           | P-9999
contribution_number  | 21
title                | Fix bug
description          | Prevent the app from deleting the User's hard drive
status               | open
source_branch        | B-1111
target_branch        | B-2222
created_at           | 2025-05-28 13:06:09.532103+00
updated_at           | 2025-05-28 13:54:23.954913+00
author_id            | U-1234

It's not the worst query I've ever had to write out, but if you're doing this a couple times a day on a couple different tables, writing out the joins gets pretty old real fast. Especially so if you're writing it in a CLI interface where's it's a royal pain to edit the middle of a query.

Even after we get the data we get a very ID heavy view of what's going on, what's the actual project name? What are the branch names? Etc.

We can solve both of these problems by writing a bunch of joins ONCE by creating a debugging view over the table we're interested in. Something like this:

CREATE VIEW debug_contributions AS
SELECT 
  contribution.id AS contribution_id,
  contribution.project_id,
  contribution.contribution_number,
  contribution.title,
  contribution.description,
  contribution.status,
  contribution.source_branch as source_branch_id,
  source_branch.name AS source_branch_name,
  source_branch.updated_at AS source_branch_updated_at,
  contribution.target_branch as target_branch_id,
  target_branch.name AS target_branch_name,
  target_branch.updated_at AS target_branch_updated_at,
  contribution.created_at,
  contribution.updated_at,
  contribution.author_id,
  author.username AS author_username,
  author.display_name AS author_name,
  project.name AS project_name,
  '@'|| project_owner.username || '/' || project.name AS project_shorthand,
  project.owner AS project_owner_id,
  project_owner.username AS project_owner_username
FROM contributions AS contribution
JOIN projects AS project ON contribution.project_id = project.id
JOIN users AS author ON contribution.author_id = author.id
JOIN users AS project_owner ON project.owner = project_owner.id
JOIN branches AS source_branch ON contribution.source_branch = source_branch.id
JOIN branches AS target_branch ON contribution.target_branch = target_branch.id;

Okay, that's a lot to write out at once, but we never need to write that again. Now if we need to answer the same question we did above we do:

SELECT * from debug_contributions 
  WHERE author.username = 'sophie'
    AND project_shorthand = '@unison/cloud'
    ORDER BY source_branch_updated_at DESC;

Which is considerably easier on both my brain and my fingers. I also get all the information I could possibly want in the result!

You can craft one of these debug tables for whatever your needs are for each and every table you work with, and since it's just a view, it's trivial to update or delete, and doesn't take any space in the DB itself.

Obviously querying over project_shorthand = '@unison/cloud' isn't going to be able to use an index, so isn't going to be the most performant query; but these are one off queries, so it's not a concern (to me at least). If you care about that sort of thing you can leave out the computed columns so you won't have to worry about that.

Anyways, that's it, that's the whole trick. Go make some debugging views and save your future self some time.

Hopefully you learned something 🤞! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put these things together, if I managed to teach your something or even just entertain you for a minute or two maybe send a few bucks my way for a coffee? Cheers! �

Become a Patron!

August 13, 2025 12:00 AM

August 12, 2025

Haskell Interlude

68: Michael Snoyman

In this episode, we’re joined by Michael Snoyman, author of Yesod, Conduit, Stackage and many other popular Haskell libraries.
We discuss newcomer friendliness, being a Rustacean vs a Haskellasaur, how STM is Haskell’s best feature and how laziness can be a vice.

by Haskell Podcast at August 12, 2025 02:00 PM

Chris Penner

Save memory and CPU with an interning cache

Save memory and CPU with an interning cache

This post will introduce a simple caching strategy, with a small twist, which depending on your app may help you not only improve performance, but might also drastically reduce the memory residency of your program.

I had originally written this post in 2022, but looks like I got busy and failed to release it, so just pretend you're reading this in 2022, okay? It was a simpler time.

In case you're wondering, we continued to optimize storage since and modern UCM uses even less memory than back in 2022 😎.

Spoiler warning, with about 80 lines of code, I was able to reduce both the memory residency and start-up times by a whopping ~95%! From 90s -> 4s startup time, and from 2.73GB -> 148MB. All of these gains were realized by tweaking our app to enforce sharing between identical objects in memory.

Case Study

I help build the Unison Language. One unique thing about the language is that programmers interact with the language through the Unison Codebase Manager (a.k.a. ucm), which is an interactive shell. Some users have started to amass larger codebases, and lately we've been noticing that the memory usage of ucm was growing to unacceptable levels.

Loading one specific codebase, which I'll use for testing throughout this article, required 2.73GB and took about 90 seconds to load from SQLite. This is far larger and slower than we'd like.

There are 2 important facets of how Unison stores code that will be important to know as we go forward, and will help you understand whether this technique might work for you.

  • Unison codebases are append-only, and codebase definitions are referenced by a content-based hash.

A Unison codebase is a tree with many branches, each branch contains many definitions and also has references its history. In Unison, once a definition is added to the codebase it is immutable, this is similar to how commits work in git; commits can be built upon, and branches can change which commit they point to, but once a commit is created it cannot be changed and is uniquely identified by its hash.

  • A given Unison codebase is likely to refer to subtrees of code like libraries many times across different Unison branches. E.g. most projects contain a reference to the base library.

A Unison project can pull in the libraries it depends on by simply mounting that dependency into its lib namespace. Doing so is inexpensive because in effect we simply copy the hash which refers to a given snapshot of the library, we don't need to make copies of any of the underlying code. However, when loading the codebase into memory ucm was hydrating each and every library reference into a full in-memory representation of that code. No good!

What is sharing and why do I want it?

Sharing is a very simple concept at its core: rather than having multiple copies of the same identical object in memory, we should just have one. It's dead simple if you say it like that, but there are many ways we can end up with duplicates of values in memory. For example, if I load the same codebase from SQLite several times then SQLite won't know that the object I'm loading already exists in memory and will make a whole new copy.

In a language where data is mutable by default you'll want to think long and hard about whether sharing is sensible or even possible for your use-case, but luckily for me, everything in Haskell is immutable by default so there's absolutely no reason to make copies of identical values.

There's an additional benefit to sharing beyond just saving memory: equality checks may be optimized! Some Haskell types like ByteStrings include an optimization in their Eq instance which short circuits the whole check if the two values are pointer-equal. Typically testing equality on string-like values is actually most expensive when the two strings are actually equal since the check must examine every single byte to see if any of them differ. By interning our values using a cache we can reduce these checks become a single pointer equality check rather than an expensive byte-by-byte check.

Implementation

One issue with caches like this is that they can grow to eventually consume unbounded amounts of memory, we certainly don't want every value we've ever cached to stay there forever. Haskell is a garbage collected language, so naturally the ideal situation would be for a value to live in the cache up until it is garbage collected, but how can we know that?

GHC implements weak pointers! This nifty feature allows us to do two helpful things:

  1. We can attach a finalizer to the values we return from the cache, such that values will automatically evict themselves from the cache when they're no longer reachable.
  2. Weak references don't prevent the value they're pointing to from being garbage collected. This means that if a value is only referenced by a weak pointer in a cache then it will still be garbage collected.

As a result, there's really no downside to this form of caching except a very small amount of compute and memory used to maintain the cache itself. Your mileage may vary, but as the numbers show, in our case this cost was very much worth it when compared to the gains.

Here's an implementation of a simple Interning Cache:

module InternCache
  ( InternCache,
    newInternCache,
    lookupCached,
    insertCached,
    intern,
    hoist,
  )
where

import Control.Monad.IO.Class (MonadIO (..))
import Data.HashMap.Strict (HashMap)
import Data.HashMap.Strict qualified as HashMap
import Data.Hashable (Hashable)
import System.Mem.Weak
import UnliftIO.STM

-- | Parameterized by the monad in which it operates, the key type, 
-- and the value type.
data InternCache m k v = InternCache
  { lookupCached :: k -> m (Maybe v),
    insertCached :: k -> v -> m ()
  }

-- | Creates an 'InternCache' which uses weak references to only 
-- keep values in the cache for as long as they're reachable by 
-- something else in the app.
--
-- This means you don't need to worry about a value not being 
-- GC'd because it's in the cache.
newInternCache :: 
  forall m k v. (MonadIO m, Hashable k) 
  => m (InternCache m k v)
newInternCache = do
  var <- newTVarIO mempty
  pure $
    InternCache
      { lookupCached = lookupCachedImpl var,
        insertCached = insertCachedImpl var
      }
  where
    lookupCachedImpl :: TVar (HashMap k (Weak v)) -> k -> m (Maybe v)
    lookupCachedImpl var ch = liftIO $ do
      cache <- readTVarIO var
      case HashMap.lookup ch cache of
        Nothing -> pure Nothing
        Just weakRef -> do
          deRefWeak weakRef

    insertCachedImpl :: TVar (HashMap k (Weak v)) -> k -> v -> m ()
    insertCachedImpl var k v = liftIO $ do
      wk <- mkWeakPtr v (Just $ removeDeadVal var k)
      atomically $ modifyTVar' var (HashMap.insert k wk)

    -- Use this as a finalizer to remove the key from the map 
    -- when its value gets GC'd
    removeDeadVal :: TVar (HashMap k (Weak v)) -> k -> IO ()
    removeDeadVal var k = liftIO do
      atomically $ modifyTVar' var (HashMap.delete k)

-- | Changing the monad in which the cache operates with a natural transformation.
hoist :: (forall x. m x -> n x) -> InternCache m k v -> InternCache n k v
hoist f (InternCache lookup' insert') =
  InternCache
    { lookupCached = f . lookup',
      insertCached = \k v -> f $ insert' k v
    }

Now you can create a cache for any values you like! You can maintain a cache within the scope of a given chunk of code, or you can make a global cache for your entire app using unsafePerformIO like this:

-- An in memory cache for interning hashes.
-- This allows us to avoid creating multiple in-memory instances of the same hash bytes;
-- but also has the benefit that equality checks for equal hashes are O(1) instead of O(n), since
-- they'll be pointer-equal.
hashCache :: (MonadIO m) => InternCache m Hash Hash
hashCache = unsafePerformIO $ hoist liftIO <$> IC.newInternCache @IO @Hash @Hash 
{-# NOINLINE hashCache #-}

And here's an example of what it looks like to use the cache in practice:

expectHash :: HashId -> Transaction Hash
expectHash h =
  -- See if we've got the value in the cache
  lookupCached hashCache h >>= \case
    Just hash -> pure hash
    Nothing -> do
      hash <-
        queryOneCol
          [sql|
              SELECT base32
              FROM hash
              WHERE id = :h
            |]
      -- Since we didn't have it in the cache, add it now
      insertCached hashCache h hash
      pure hash

For things like Hashes, the memory savings are more modest, but in the cases of entire subtrees of code the difference for us was substantial. Not only did we save memory, but we saved a ton of time re-hydrating subtrees of code from SQLite that we already had.

We can even get the benefits of a cache like this when we don't have a separate key for the value, as long as the value itself has a Hashable or Ord instance (if you swap the InternCache to use a regular Map). We can use it as its own key, this doesn't help us avoid the computational cost of creating the value, but it still gives us the memory savings:

-- | When a value is its own key, this ensures that the given value 
-- is in the cache and always returns the single canonical in-memory 
-- instance of that value, garbage collecting any others.
intern :: (Hashable k, Monad m) => InternCache m k k -> k -> m k
intern cache k = do
  mVal <- lookupCached cache k
  case mVal of
    Just v -> pure v
    Nothing -> do
      insertCached cache k k
      pure k

Conclusion

An approach like this doesn't work for every app, it's much easier to use when working with immutable values like this, but if there's a situation in your app where it makes sense I recommend giving it a try! I'll reiterate that for us, we dropped our codebase load times from 90s down to 4s, and our resting memory usage from 2.73GB down to 148MB.

Hopefully you learned something 🤞! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put these things together, if I managed to teach your something or even just entertain you for a minute or two maybe send a few bucks my way for a coffee? Cheers! �

Become a Patron!

August 12, 2025 12:00 AM

August 11, 2025

Philip Wadler

The Provocateurs: Brave New Bullshit

 

Update: My colleague Elizabeth Polgreen has kindly written a post for the ETAPS Blog describing my show.
Philip Wadler is a man who wears many different hats. Both literally: fedoras, trilbys, even the occasional straw hat, and metaphysically: recently retired Professor of theoretical computer science at the University of Edinburgh; Fellow of the Royal Society; senior researcher at the blockchain infrastructure company IOHK; Lambda Man; often-times favourite lecturer of the first year computer science students; and, occasionally, stand-up comedian. It is the latter role that leads me to ask Phil if he will participate in a Q&A.
[Previous post repeated below.]

Following two sell-out shows at the Fringe last year, I'm on at the Fringe again:

11.25 Monday 4 August, Stand 2 w/Lucy Remnant and Susan Morrison
17.40 Sunday 17 August, Stand 4 w/Smita Kheria and Sarah-Jane Judge
17.40 Tuesday 19 August, Stand 4 w/Cameron Wyatt and Susan Morrison

Shows are under the banner of The Provocateurs (formerly Cabaret of Dangerous Ideas). Tickets go on sale Wednesday 7 May, around noon. The official blurb is brief:

Professor Philip Wadler (The University of Edinburgh) separates the hopes and threats of AI from the chatbot bullshit.

Here is a longer blurb, from my upcoming appearance at Curious, run by the RSE, in September.
Brave New Bullshit
In an AI era, who wins and who loses?

Your future workday might look like this: 
  • You write bullet points.
  • You ask a chatbot to expand them into a report.
  • You send it to your boss ...
  • Who asks a chatbot to summarise it to bullet points.
Will AI help you to do your job or take it from you? Is it fair for AI to be trained on copyrighted material? Will any productivity gains benefit everyone or only a select few?
 
Join Professor Philip Wadler’s talk as he looks at the hopes and threats of AI, exploring who wins and who loses.

by Philip Wadler (noreply@blogger.com) at August 11, 2025 07:04 PM

Monday Morning Haskell

In-Order Traversal in Haskell and Rust

Last time around, we started exploring binary trees. We began with a simple problem (inverting a tree), but encountered some of the difficulties implementing a recursive data structure in Rust.

Today we’ll do a slightly harder problem (LeetCode rates as “Medium” instead of “Easy”). This problem is also specifically working with a binary search tree instead of a simple binary tree. With a search tree, we have the property that the “values” on each node are orderable, and all the values to the “left” of any given node are no greater than that node’s value, and the values to the “right” are not smaller.

Binary search trees are the heart of any ordered Set type. In our problem solving course Solve.hs, you’ll get the chance to build a self-balancing binary search tree from scratch, which involves some really cool algorithmic tricks!

The Problem

Though it’s harder than our previous problem, today’s problem is still straightforward. We are taking an ordered binary search tree and finding the k-th smallest element in that tree, where k is the second input to our function.

So suppose our tree looks like this:

-    45
    /  \
   32  50
  /  \   \
 5   40   100
    /  \
  37   43

Our input k is 1-indexed. So if we get 1 as our input, we should return 5, the smallest element in the tree. If we receive 4, we should return 40, the 4th smallest element after 5, 32 and 37. If we get 8, we’ll return 100, the largest element in the tree.

The Algorithm

Binary search trees are designed to give logarithmic time access, insertions, and deletions for elements. If our tree was annotated, so that each node stored the number of children it had, we’d be able to solve this problem in logarithmic time as well.

However, we want to assume a minimal tree design, where each node only holds its own value and pointers to its two children. With these constraints, our algorithm has to be linear in terms of the input k.

We’re going to solve this with an in-order traversal. We’re just going to traverse the elements of the BST in order from smallest value to largest, and count until we’ve encountered k elements. Then we’ll return the value at our “current” node.

An in-order traversal is conceptually simple. For a given node, we “visit” the left child, then visit the node itself, and then visit the right child. But the actual mechanics of doing this traversal can be a little tricky to think of on the spot if you haven’t practiced it before.

The main idea is that we’ll use a stack of nodes to track where we are in the tree. The stack traces a path from our “current” node back up its parents back to the root of the tree. Our algorithm looks like this:

First, we create a stack from the root, following all left child nodes until we reach a node without a left child. This node has the smallest element of the tree.

Now, we begin processing, always considering the top node in the stack, while tracking the number of elements remaining until we hit k. If that number is down to 1, then the value at the top node in the stack is our result.

If not, we’ll decrement the number of elements remaining, and then check the right child of this node. If the right child is Nil, we just pop this node from the stack and process its parent. If the right child does exist, we’ll add all of its left children to our stack as well.

Here’s how the stack looks like with our example tree, if we’re looking for k=7 (which should be 50).

[5, 32, 45] -- k = 1, Initial left children of root
[32, 45] -- k = 2, popped 5, no right child
[37, 40, 45] -- k = 3, popped 32, right child was 40, which added left child 37
[40, 45] -- k = 4, popped 37, no right child
[43, 45] -- k = 5, popped 40, and 43 is right child
[45] -- k = 6, popped 43
[50] -- k = 7, popped 45 and added 50, the right child (no left children)

Since 50 is on top of the stack with k = 7, we can return 50.

Haskell Solution

Let’s code this up! We’ll start with Haskell, since Rust is, once again, somewhat tricky due to TreeNode handling. To start, let’s remind ourselves of the recursive TreeNode type:

data TreeNode = Nil | Node Int TreeNode TreeNode
  deriving (Show, Eq)

Now when writing up the algorithm, we first want to define a helper function addLeftNodesToStack. We’ll use this at the beginning of the algorithm, and then again each time we encounter a right child.

This helper will take a TreeNode and the existing stack, and return the modified stack.

kthSmallest :: TreeNode -> Int -> Int
kthSmallest root' k' = ...
  where
    addLeftNodesToStack :: TreeNode -> [TreeNode] -> [TreeNode]
    addLeftNodesToStack = ...

As far as recursive helpers, this is a simple one! If our input node is Nil, we return the original stack. We want to maintain the invariant that we never include Nil values in our stack! But if we have a value node, we just add it to the stack and recurse on its left child.

kthSmallest :: TreeNode -> Int -> Int
kthSmallest root' k' = ...
  where
    addLeftNodesToStack :: TreeNode -> [TreeNode] -> [TreeNode]
    addLeftNodesToStack Nil acc = acc
    addLeftNodesToStack root@(Node _ left _) acc = addLeftNodesToStack left (root : acc)

Now it’s time to implement our algorithm for finding the k-th element. This will be a recursive function that takes the number of elements remaining, as well as the current stack. We’ll call this initially with k and the stack we get from adding the left nodes of the root:

kthSmallest :: TreeNode -> Int -> Int
kthSmallest root' k' = findK k' (addLeftNodesToStack root' [])
  where
    addLeftNodesToStack = ...

    findK :: Int -> [TreeNode] -> Int
    findK = ...

This function has a couple error cases. We expect a non-empty stack (our input k is constrained within the size of the tree), and we expect the top to be non-Nil. After that, we have our base case where k = 1, and we return the value at this node.

Finally, we get our recursive case. We decrement the remaining count, and add the left nodes of the right child of this node to the stack.

kthSmallest :: TreeNode -> Int -> Int
kthSmallest root' k' = findK k' (addLeftNodesToStack root' [])
  where
    addLeftNodesToStack = ...

    findK :: Int -> [TreeNode] -> Int
    findK k [] = error $ "Found empty list expecting k: " ++ show k
    findK _ (Nil : _) = error "Added Nil to stack!"
    findK 1 (Node x _ _ : _) = x
    findK k (Node _ _ right : rest) = findK (k - 1) (addLeftNodesToStack right rest)

This completes our solution!

kthSmallest :: TreeNode -> Int -> Int
kthSmallest root' k' = findK k' (addLeftNodesToStack root' [])
  where
    addLeftNodesToStack :: TreeNode -> [TreeNode] -> [TreeNode]
    addLeftNodesToStack Nil acc = acc
    addLeftNodesToStack root@(Node _ left _) acc = addLeftNodesToStack left (root : acc)

    findK :: Int -> [TreeNode] -> Int
    findK k [] = error $ "Found empty list expecting k: " ++ show k
    findK _ (Nil : _) = error "Added Nil to stack!"
    findK 1 (Node x _ _ : _) = x
    findK k (Node _ _ right : rest) = findK (k - 1) (addLeftNodesToStack right rest)

Rust Solution

In our Rust solution, we’re once again working with this TreeNode type, including the 3 wrapper layers:

#[derive(Debug, PartialEq, Eq)]
pub struct TreeNode {
  pub val: i32,
  pub left: Option<Rc<RefCell<TreeNode>>>,
  pub right: Option<Rc<RefCell<TreeNode>>>,
}

Our first step will be to implement the helper function to add the “left” nodes. This function will take a “root” node as well as a mutable reference to the stack so we can add nodes to it.

fn add_left_nodes_to_stack(
        node: Option<Rc<RefCell<TreeNode>>>,
        stack: &mut Vec<Rc<RefCell<TreeNode>>>,
) {
    ...
}

You’ll notice that stack does not actually use the Option wrapper, only Rc and RefCell. Remember in our Haskell solution that we want to enforce that we don’t add non-null nodes to the stack. This Rust solution enforces this constraint at compile time.

To implement this function, we’ll use the same trick we did when inverting trees to pattern match on node and detect if it is Some or None. If it is None, we don’t have to do anything.

fn add_left_nodes_to_stack(
        node: Option<Rc<RefCell<TreeNode>>>,
        stack: &mut Vec<Rc<RefCell<TreeNode>>>,
) {
    if let Some(current) = node {
        ...
    }
}

Since current is now unwrapped from Option, we can push it to the stack. As in our previous problem though, we have to clone it first! We need a clone of the reference (as wrapped by Rc because the stack will now have to own this reference.

fn add_left_nodes_to_stack(
        node: Option<Rc<RefCell<TreeNode>>>,
        stack: &mut Vec<Rc<RefCell<TreeNode>>>,
) {
    if let Some(current) = node {
        stack.push(current.clone());
        ...
    }
}

Now we’ll recurse on the left subchild of current. In order to unwrap the TreeNode from Rc/RefCell, we have to use borrow. Then we can grab the left value. But again, we have to clone it before we make the recursive call. Here’s the final implementation of this helper:

fn add_left_nodes_to_stack(
        node: Option<Rc<RefCell<TreeNode>>>,
        stack: &mut Vec<Rc<RefCell<TreeNode>>>,
) {
    if let Some(current) = node {
        stack.push(current.clone());
        add_left_nodes_to_stack(current.borrow().left.clone(), stack);
    }
}

We could have implemented the helper with a while loop instead of recursion. This would actually have used less memory in Rust! We would have to make some changes though, like making a new mut reference from the root.

Now we can move on to the core function. We’ll start this by defining key terms like our stack and the number of “remaining” values (initially k). We’ll also call our helper to get the initial stack.

pub fn kth_smallest(root: Option<Rc<RefCell<TreeNode>>>, k: i32) -> i32 {
    let mut stack = Vec::new();
    let mut remaining = k;

    add_left_nodes_to_stack(root, &mut stack);
    ...
}

Now we want to pop the top element from the stack, and pattern match it as requiring Some. If there are no more values, we’ll actually panic, because the problem constraints should mean that our stack is never empty. Unlike our helper, we actually will use a while loop here instead of more recursion:

pub fn kth_smallest(root: Option<Rc<RefCell<TreeNode>>>, k: i32) -> i32 {
    let mut stack = Vec::new();
    let mut remaining = k;

    add_left_nodes_to_stack(root, &mut stack);

    while let Some(current) = stack.pop() {
        ...
    }
    panic!("k is larger than number of nodes");
}

Now the inside of the loop is simple, following what we’ve done in Haskell. If our remainder is 1, then we have found the correct node. We borrow the node from the RefCell and return its value. Otherwise we decrement the count and use our helper on the “right” child of the node we just popped. As usual, the RefCell wrapper means we need to borrow to get the right value from the TreeNode, and then we clone this child as we pass it to the helper.

pub fn kth_smallest(root: Option<Rc<RefCell<TreeNode>>>, k: i32) -> i32 {
    let mut stack = Vec::new();
    let mut remaining = k;

    add_left_nodes_to_stack(root, &mut stack);

    while let Some(current) = stack.pop() {
        if remaining == 1 {
            return current.borrow().val;
        }
        remaining -= 1;
        add_left_nodes_to_stack(current.borrow().right.clone(), &mut stack);
    }
    panic!("k is larger than number of nodes");
}

And that’s it! Here’s the complete Rust solution:

fn add_left_nodes_to_stack(
        node: Option<Rc<RefCell<TreeNode>>>,
        stack: &mut Vec<Rc<RefCell<TreeNode>>>,
) {
    if let Some(current) = node {
        stack.push(current.clone());
        add_left_nodes_to_stack(current.borrow().left.clone(), stack);
    }
}

pub fn kth_smallest(root: Option<Rc<RefCell<TreeNode>>>, k: i32) -> i32 {
    let mut stack = Vec::new();
    let mut remaining = k;

    add_left_nodes_to_stack(root, &mut stack);

    while let Some(current) = stack.pop() {
        if remaining == 1 {
            return current.borrow().val;
        }
        remaining -= 1;
        add_left_nodes_to_stack(current.borrow().right.clone(), &mut stack);
    }
    panic!("k is larger than number of nodes");
}

Conclusion

In-order traversal is a great pattern to commit to memory, as many different tree problems will require you to apply it. Hopefully the details with Rust’s RefCells are getting more familiar. Next week we’ll do one more problem with binary trees.

If you want to do some deep work with Haskell and binary trees, take a look at our Solve.hs course, where you’ll learn about many different data structures in Haskell, and get the chance to write a balanced binary search tree from scratch!

by James Bowen at August 11, 2025 08:30 AM

Chris Penner

Using traversals to batch database queries

Using traversals to batch database queries

This article is about a code-transformation technique I used to get 100x-300x performance improvements on a particularly slow bit of code which was loading Unison code from Postgres in Unison Share. I haven't seen it documented anywhere else, so wanted to share the trick!

It's a perennial annoyance when I'm programming that often the most readable way to write some code is also directly at odds with being performant. A lot of data has a tree structure, and so working with this data is usually most simply expressed as a series of nested function calls. Nested function calls are a reasonable approach when executing CPU-bound tasks, but in webapps we're often querying or fetching data along the way. In a nested function structure we'll naturally end up interleaving a lot of one-off data requests. In most cases these data requests will block further execution until a round-trip to the database fetches the data we need to proceed.

In Unison Share, I often need to hydrate an ID into an AST structure which represents a chunk of code, and each reference in that code will often contain some metadata or information of its own. We split off large text blobs and external code references from the AST itself, so sometimes these fetches will proceed in layers, e.g. fetch the AST, then fetch the text literals referenced in the tree, then fetch the metadata for code referenced by the tree, etc.

When hydrating a large batch of code definitions, if each definition takes N database calls, loading M definitions is NxM database round-trips, NxM query plans, and potentially NxM index or table scans! If you make a call for each text ID or external reference individually, then this scales even worse.

The technique in the post details a technique for using traversals to iteratively evolve linear, nested codepaths into similar functions which work on batches of data instead. Critically, It allows keeping all the same codepaths which allow you to keep the same nested code structure, avoiding the need to restructure the whole codebase and allowing you to easily introduce batching progressively without shipping a whole rewrite at once. It also provides a trivial mechanism for deduplicating data requests, and even allows using the exact same codepath for loading 0, 1, or many entities in a typesafe way. First a quick explanation of how I ended up in this situation.

Case study: Unison Share definition loading

I'm in charge of the Unison Share code-hosting and collaboration platform. The codebase for this webapp started its life by collecting bits and pieces of code from the UCM CLI application. UCM uses SQLite, so the first iteration was minimal rewrite which simply replaced SQLite queries with the equivalent Postgres queries, but the codepaths themselves were left largely the same.

SQLite operates in-process and loads everything from memory or disk, so for our intents and purposes in UCM it has essentially no latency. As a result, most code for loading definitions from the user's codebase in UCM was written simply and linearly, loading the data only as it is needed. E.g. we may have a method loadText :: TextId -> Sqlite.Transaction Text, and when we needed to load many text references it was perfectly reasonable to just traverse loadText over a list of IDs.

However, not all databases have the same trade-offs! In the Unison Share webapp we use Postgres, which means the database has a network call and round-trip latency for each and every query. We now pay a fixed round-trip latency cost on every query that simply wasn't a factor before. Something simple like traverse loadText textIds is now performing hundreds of sequential database calls and individual text index lookups! Postgres doesn't know anything about which query we'll run next, so it can't optimize this at all (aside from warming up caches) That's clearly not good.

To optimize for Postgres we'd much prefer to make one large database call which takes an array of a batch of TextIds and returns all the Text results in a single query, this allows Postgres to save a lot of work by finding all text values in a single scan, and means we only incur a single round-trip delay rather than one per text.

Here's a massively simplified sketch of what the original naive linear code looked like:

loadTerm :: TermReference -> Transaction (AST TermInfo Text)
loadTerm ref = do
  ast <- loadAST ref
  bitraverse loadTermInfo loadText ast

loadTermInfo :: TermReference -> Transaction TermInfo
loadTermInfo ref =
  queryOneRow [sql| SELECT name, type FROM terms WHERE ref = #{ref} |]

loadText :: TextId -> Transaction Text
loadText textId =
  queryOneColumn [sql| SELECT text FROM texts WHERE id = #{textId} |]

We really want to load all the Texts in a single query, but the TextIds aren't just sitting in a nice list, they're nested within the AST structure.

Here's some pseudocode for fetching a these as a batch:

batchLoadASTTexts :: AST TermReference TextId -> Transaction (AST TermInfo Text)
batchLoadASTTexts ast = do
  let textIds = Foldable.toList ast
  texts <- fetchTexts textIds
  for ast \textId ->
    case Map.lookup textId texts of
      Nothing -> throwError $ MissingText textId
      Just text -> pure text
  where
    fetchTexts :: [TextId] -> Transaction (Map TextId Text)
    fetchTexts textIds = do
      resolvedTexts <- queryListColumns [sql|
        SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds})
      |]
      pure $ Map.fromList resolvedTexts

This solves the biggest problem, most importantly it reduces N queries down to a single batch query which is already a huge improvement! However, it is a bit of boilerplate, and we'd need to write a custom version of this for each container we want to batch load texts from.

Clever folks will realize that we actually don't care about the AST structure at all, we only need a container which is Traversable, so we can generalize over that:

batchLoadTexts :: Traversable t => t TextId -> Transaction (t Text)
batchLoadTexts textIds = do
  resolvedTexts <- fetchTexts textIds
  pure $ fmap (\textId -> case Map.lookup textId resolvedTexts of
    Nothing -> throwError $ MissingText textId
    Just text -> text) textIds
  where
    fetchTexts :: [TextId] -> Transaction (Map TextId Text)
    fetchTexts textIds = do
      resolvedTexts <- queryListColumns [sql|
        SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds})
      |]
      pure $ Map.fromList resolvedTexts

This is much better, now we can use this on any form of Traversable, meaning we can now batch load from ASTs, lists, vectors, Maps, and can even just use Identity to re-use our query logic for a single ID like this:

loadText :: TextId -> Transaction Text
loadText textId = do
  Identity text <- batchLoadTexts (Identity textId)
  pure text

This approach does still require that the IDs you want to batch load are the focus of some Traversable instance. What if instead your structure contains a half-dozen different ID types, or is arranged such that it's not in the Traversable slot of your type parameters? Bitraversable can handle up to two parameters, but after that you're back to writing bespoke functions for your container types.

For instance, how would we use this technique to batch load our TermInfo from the AST's TermReferences?

-- Assume we've written these batched term and termInfo loaders:
batchLoadTexts :: Traversable t => t TextId -> Transaction (t Text)
batchLoadTermInfos :: Traversable t => t TermReference -> Transaction (t TermInfo)

loadTerm :: TermReference -> Transaction (AST TermInfo Text)
loadTerm termRef = do
  ast <- loadAST termRef
  astWithText <- batchLoadTexts ast
  ??? astWithText -- How do we load the TermInfos in here?

We're getting closer, but Traversable instances just aren't very adaptable, the relevant ID must always be in the final parameter of the type. In this case you could get by using Flip wrapper, but it's not going to be very readable and this technique doesn't scale past two parameters.

We need some way to define and compose bespoke Traversable instances for any given situation.

Custom Traversals

In its essence, the Traversable type class is just a way to easily provide a canonical implementation of traverse for a given type:

traverse :: Applicative f => (a -> f b) -> t a -> f (t b)

As it turns out, we don't need a type class in order to construct and pass functions of this type around, we can define them ourselves.

With this signature it's still requiring that the elements being traversed are the final type parameter of the container t; we need a more general version. We can use this instead:

type Traversal s t a b = Applicative f => (a -> f b) -> s -> f t

It looks very similar, but note that s and t are now concrete types of kind *, they don't take a parameter, which means we can pick any fully parameterized type we like for s and t which focus some other type a and convert or hydrate it into b.

E.g. If we want a traversal to focus the TermReferences in an AST and convert them to TermInfos, we can write:

Traversal (AST TermReference text) (AST TermInfo text) TermReference TermInfo

-- Which expands to the function type:

Applicative f => (TermReference -> f TermInfo) -> AST TermReference text -> f (AST TermInfo text)

If you've ever worked with optics or the lens library before this should be looking mighty familiar, we've just derived lens's Traversal type!

Most optics are essentially just traversals, we can write one-off traversals for any situation we might need, and can trivially compose small independent traversals together to create more complex traversals.

Let's rewrite our batch loaders to take an explicit Traversal argument.

import Control.Lens qualified as Lens
import Data.Functor.Contravariant

-- Take a traversal, then a structure 's', and replace all TextIds with Texts to
-- transform it into a 't'
batchLoadTextsOf :: Lens.Traversal s t TextId Text -> s -> Transaction t
batchLoadTextsOf traversal s = do
  let textIds = toListOf (traversalToFold traversal) s
  resolvedTexts <- fetchTexts textIds
  Lens.forOf traversal s $ \textId -> case Map.lookup textId resolvedTexts of
    Nothing -> throwError $ MissingText textId
    Just text -> pure text
  where
    fetchTexts :: [TextId] -> Transaction (Map TextId Text)
    fetchTexts textIds = do
      resolvedTexts <- queryListColumns [sql|
        SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds})
      |]
      pure $ Map.fromList resolvedTexts

traversalToFold ::
  (Applicative f, Contravariant f) =>
  Lens.Traversal s t a b ->
  Lens.LensLike' f s a
traversalToFold traversal f s = phantom $ traversal (phantom . f) s

The *Of naming convention comes from the lens library. A combinator ending in Of takes an traversal as an argument.

It's a bit unfortunate that we need traversalToFold, it's just a quirk of how Traversals and Folds are implemented in the lens library, but don't worry we'll replace it with something better soon.

Now we can pass any custom traversal we like into batchLoadTexts and it will batch up the IDs and hydrate them in-place.

Let's write the AST traversals we need:

astTexts :: Traversal (AST TermReference TextId) (AST TermReference Text) TextId Text
astTexts = traverse

astTermReferences :: Traversal (AST TermReference TextId) (AST TermInfo Text) TermReference TermInfo
astTermReferences f = bitraverse f pure

Here we can just piggy-back on existing traverse and bitraverse implementations, but if you need to write your own, I included a small guide on writing your own custom Traversals with the traversal method in the lens library, go check that out.

With this, we can now batch load both the texts and term infos from an AST in one pass each.

loadTerm :: TermReference -> Transaction (AST TermInfo Text)
loadTerm termRef = do
  ast <- loadAST termRef
  astWithText <- batchLoadTextsOf astTexts ast
  hydratedAST  <- batchLoadTermInfosOf astTermReferences astWithText
  pure hydratedAST

Scaling up

Okay now we're cooking, we've reduced the number of queries per term from 1 + numTexts + numTermRefs down to a flat 3 queries per term, which is a huge improvement, but there's more to do.

What if we need to load a whole batch of asts at once? Here's a first attempt:

-- Assume these batch loaders are in scope:
batchLoadTermASTs :: Traversal s t TermReference (AST TermReference TextId) -> s -> Transaction t
batchLoadTermInfos :: Traversal s t TermReference TermInfo -> s -> Transaction t
batchLoadTexts :: Traversal s t TextId Text -> s -> Transaction t

batchLoadTerms :: Map TermReference TextId -> Transaction (Map TermReference (AST TermInfo Text))
batchLoadTerms termsMap = do
  termASTsMap <- batchLoadTermASTs traverse termsMap
  for termASTsMap \ast -> do
    astWithTexts <- batchLoadTexts astTexts ast
    hydratedAST <- batchLoadTermInfos astTermReferences astWithTexts
    pure hydratedAST

This naive approach loads the asts in a batch, but then traverses over the resulting ASTs batch loading the terms and texts: This is better than no batching at all, but we're still running queries in a loop. 2 queries for each term in the map is still O(N) queries, we can do better.

Luckily, Traversals are easily composable! We can effectively distribute the for loop into our batch calls by adding composing an additional traverse so each traversal is applied to every element of the outer map. In case you're not familiar with optics, just note that traversals compose from outer to inner from left to right, using .; it looks like this:

batchLoadTerms :: Map TermReference TextId -> Transaction (Map TermReference (AST TermInfo Text))
batchLoadTerms termsMap = do
  termASTsMap <- batchLoadTermASTs traverse termsMap
  astsMapWithTexts <- batchLoadTexts (traverse . astTexts) termASTsMap
  hydratedASTsMap <- batchLoadTermInfos (traverse . astTermReferences) astsMapWithTexts
  pure hydratedASTsMap

If you want, you can even pipeline it like so:

  batchLoadTermASTs traverse termsMap
    >>= batchLoadTexts (traverse . astTexts)
    >>= batchLoadTermInfos (traversed . astTermReferences)

It was a small change, but this performs much better at scale, we went from O(N) queries to O(1) queries, that is, we now run EXACTLY 3 queries, no matter how many terms we're loading, pretty cool. In fact, the latter two queries have no data-dependencies on each other, so you can also pipeline them if your DB supports that, but I'll leave that as an exercise (or come ask me on bluesky).

That's basically the technique, the next section will show a few tweaks which help me to use it at application scale.

Additional tips

Let's revisit the database layer where we actually make the batch query:

import Control.Lens qualified as Lens
import Data.Functor.Contravariant

-- Take a traversal, then a structure 's', and replace all TextIds with Texts to
-- transform it into a 't'
batchLoadTextsOf :: Lens.Traversal s t TextId Text -> s -> Transaction t
batchLoadTextsOf traversal s = do
  let textIds = toListOf (traversalToFold traversal) s
  resolvedTexts <- fetchTexts textIds
  Lens.forOf traversal s $ \textId -> case Map.lookup textId resolvedTexts of
    Nothing -> throwError $ MissingText textId
    Just text -> pure text
  where
    fetchTexts :: [TextId] -> Transaction (Map TextId Text)
    fetchTexts textIds = do
      resolvedTexts <- queryListColumns [sql|
        SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds})
      |]
      pure $ Map.fromList resolvedTexts

traversalToFold ::
  (Applicative f, Contravariant f) =>
  Lens.Traversal s t a b ->
  Lens.LensLike' f s a
traversalToFold traversal f s = phantom $ traversal (phantom . f) s

This pattern is totally fine, but it does involve materializing and sorting a Map of all the results, which also requires an Ord instance on the database key we use. Here's an alternative approach:

import Control.Lens qualified as Lens
import Data.Functor.Contravariant
-- Take a traversal, then a structure 's', and replace all TextIds with Texts to
-- transform it into a 't'
batchLoadTextsOf :: Lens.Traversal s t TextId Text -> s -> Transaction t
batchLoadTextsOf traversal s = do
  s & unsafePartsOf traversal %%~ \textIds -> do
      let orderedIds = zip [0 :: Int32 ..] textIds
      queryListColumns [sql|
        WITH text_ids(ord, id) AS (
          SELECT * unnest(#{toArray orderedIds}) AS ids(ord, id)
        )
        SELECT texts.text 
          FROM texts JOIN text_ids ON texts.id = text_ids.id;
        ORDER BY text_ids.ord ASC
      |]

Using unsafePartsOf allows us to act on the foci of a traversal as though they were in a simple list. The unsafe bit is that it will crash if we don't return a list with the exact same number of elements, so be aware of that, but it's the same crash we'd have gotten in our old version if an ID was missing a value.

This also allows us to avoid the song-and-dance for converting the incoming traversal into a fold.

We need the ord column simply because sql doesn't guarantee any specific result order unless we specify one. This will pair up result rows piecewise with the input IDs, and so it doesn't require any Ord instance.

We can wrap unsafePartsOf with our own combinator to add a few additional features.

Here's a version which will deduplicate IDs in the input list, will skip the action if the input list is empty, and will provide a nice error with a callstack if anything goes sideways.

asListOf :: (HasCallStack, Ord a) => Traversal s t a b -> Traversal s t [a] [b]
asListOf trav f s =
  s
    & unsafePartsOf trav %%~ \case
      -- No point making a database call which will return no results
      [] -> pure []
      inputs -> do
        -- First, deduplicate the inputs as a self indexed map.
        let asMap = Map.fromList (zip inputs inputs)
        asMap
          -- Call the action with the list of deduped inputs
          & unsafePartsOf traversed f
          <&> \resultMap ->
            -- Now map the result for each input in the original list to its result value
            let resultList = mapMaybe (\k -> Map.lookup k resultMap) inputs
                aLength = length inputs
                bLength = length resultList
             in if aLength /= bLength
                  -- Better error message if our query is bad and returns the wrong number of elements.
                  then error $ "asListOf: length mismatch, expected " ++ show aLength ++ " elements, got " ++ show bLength <> " elements"
                  else resultList

Using a tool like this has caveats, it's very easy to cause runtime crashes if your query isn't written to always return the same number of results as it was given inputs, and skipping the action on empty lists could result in some confusion.

Conclusion

I've gotten a ton of use out of this technique in Unison Share, and managed to speed things up by 2 orders of magnitude. I was also able to perform a fully batched rewrite of heavily nested code without needing to re-arrange the code-graph. This was particularly useful because it allowed me to partially large portions of the codebase in smaller pieces by using batched methods with a simple id Traversal, and using simple traverse on methods you haven't rewritten yet.

You may not get such huge gains if your code isn't pessimistically linear in the first place, but this is also a nice, composable way to write batch code in the first place.

Anyways, give it a go and let me know what you think of it!

Hopefully you learned something 🤞! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put these things together, if I managed to teach your something or even just entertain you for a minute or two maybe send a few bucks my way for a coffee? Cheers! �

Become a Patron!

August 11, 2025 12:00 AM

August 08, 2025

Well-Typed.Com

Well-Typed at ZuriHac 2025

Well-Typed was strongly represented at this year’s ZuriHac, with our team of Haskell experts giving eight talks across ZuriHac itself and the Haskell Ecosystem and Implementors’ Workshops. We’re pleased to report that the recordings are now available.

ZuriHac Beginners Track

Andres hosted the Beginners Track at ZuriHac, delivering a four-hour tutorial that covers all the fundamentals of the Haskell language. It’s an excellent starting point for anyone interested in learning Haskell, taught by one of the community’s most experienced Haskell educators.

Haskell Ecosystem Workshop

Matt was lucky to be invited to give a talk about our work on memory profiling over the last five years. Profiling and observability have been a key focus for Well-Typed. We have developed tooling which allows easy and powerful introspection into the runtime performance of Haskell programs. You can read more about our work in this area in posts tagged with profiling.

Haskell Implementors Workshop

The Haskell Implementors Workshop was a great opportunity to share our progress on improvements to GHC over the last year. It’s always nice to take a moment to reflect on the progress we’ve made and the work we’ve done.

Ben and Andreas kicked things off with the annual GHC status report. This report provides a summary of the essential maintenance and community stewardship work which Well-Typed performs for the GHC project.

Hannes introduced recent improvements to GHCi to support multi-unit sessions natively. This is the latest in our long-running work to improve the ecosystem support for project-based workflows with many different packages being developed in parallel.

Rodrigo showcased his work on a standalone step-through debugger for GHC. We have implemented a GHC API application which uses the Debug Adapter Protocol to communicate with any debugger frontend. We look forward to releasing this work to the public in the near future. Which will give Haskell programmers access to a maintained and powerful debugger.

Matt presented the work on Explicit Level Imports which aims to make it clear what exactly is needed by Template Haskell (during compilation) and what is needed during runtime. An important stepping stone to improving the developer experience for projects relying on both cross compilation and Template Haskell.

Finally, there were two more research-oriented presentations.

Matt presented some joint work with his collaborator Ellis Kesteron on a possible improvement to the desugaring of Typed Template Haskell quotations, which would make it easier to perform well-typed intensional syntax analysis.

Andreas presented his idea of the ability to express strictness properties of a function in the type level. His talk explored different ideas in how these annotations may affect unboxing and optimisation passes such as worker-wrapper transformations.

Conclusion

Well-Typed offer Haskell Ecosystem Support Packages in partnership with the Haskell Foundation, to provide commercial users with support from Well-Typed’s experts, while investing in the Haskell community and its technical ecosystem. These projects were made possible by funding from our clients, notably Mercury, who are improving the experience for Haskell developers by supporting foundational work on Haskell tools.

It was great to meet everyone who attended the workshops and asked interesting questions during and after the talks. We hope to see you all again next year!

by matthew at August 08, 2025 12:00 AM

GHC Developer Blog

GHC 9.10.3-rc3 is now available

GHC 9.10.3-rc3 is now available

wz1000 - 2025-08-08

The GHC developers are very pleased to announce the availability of the third release candidate for GHC 9.10.3. Binary distributions, source distributions, and documentation are available at downloads.haskell.org and via GHCup.

GHC 9.10.3 is a bug-fix release fixing over 50 issues of a variety of severities and scopes. A full accounting of these fixes can be found in the release notes. As always, GHC’s release status, including planned future releases, can be found on the GHC Wiki status.

The changes from the second release candidate are:

  • Reverting a change the exports of the Backtrace constructor in the base library that was backported due to confusion on CLC approvals (!14587)
  • Reverting a change to the configure script (!14324) that dropped probing for ld.gold

This release candidate will have a two-week testing period. If all goes well the final release will be available the week of 22 August 2025.

We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable, Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors whose on-going financial and in-kind support has facilitated GHC maintenance and release management over the years. Finally, this release would not have been possible without the hundreds of open-source contributors whose work comprise this release.

As always, do give this release a try and open a ticket if you see anything amiss.

by ghc-devs at August 08, 2025 12:00 AM

August 07, 2025

Tweag I/O

Getting started with CodeQL, GitHub's declarative static analyzer for security

CodeQL is a declarative static analyzer owned by GitHub, whose purpose is to discover security vulnerabilities. Declarative means that, to use CodeQL, you write rules describing the vulnerabilities you want to catch, and you let an engine check your rules against your code. If there is a match, an alert is raised. Static means that it checks your source code, as opposed to checking specific runs. Owned by GitHub means that CodeQL’s engine is not open-source: it’s free to use only on research and open-source code. If you want to use CodeQL on proprietary code, you need a GitHub Advanced Security license. CodeQL rules, that model specific programming languages and libraries, however, are open-source.

CodeQL is designed to do two things:

  1. Perform all kinds of quality and compliance checks. CodeQL’s query language is expressive enough to describe a variety of patterns (e.g., “find any loop, enclosed in a function named foo, when the loop’s body contains a call to function bar�). As such, it enables complex, semantic queries over codebases, which can uncover a wide range of issues and patterns.
  2. Track the flow of tainted data. Tainted data is data provided by a potentially malicious user. If tainted data is sent to critical operations (database requests, custom processes) without being sanitized, it can have catastrophic consequences, such as data loss, a data breach, arbitrary code execution, etc. Statements of your source code from where tainted data originates are called sources, while statements of your source code where tainted data is consumed are called sinks.

This tutorial is targeted at software and security engineers that want to try out CodeQL, focusing on the second use case from above. I explain how to setup CodeQL, how to write your first taint tracking query, and give a methodology for doing so.

Writing the vulnerable code

First, I need to write some code to execute my query against. As the attack surface, I’m choosing calls to the sarge Python library, for three reasons:

  • It is available on PyPI, so it is easy to install.
  • It is niche enough that it is not already modeled in CodeQL’s Python standard library, so out of the box queries from CodeQL won’t catch vulnerabilities that use sarge. We need to write our own rules.
  • It performs calls to subprocess.Popen, which is a data sink. As a consequence, code calling sarge is prone to having command injection vulnerabilities.

For my data source, I use flask. That’s because HTTP requests contain user-provided data, and as such, they are modeled as data sources in CodeQL’s standard library. With both sarge and flask in place, we can write the following vulnerable code:

from flask import Flask, request

import sarge

app = Flask(__name__)


@app.route("/", methods=["POST"])
def user_to_sarge_run():
    """This function shows a vulnerability: it forwards user input (through a POST request) to sarge.run."""
    print("/ handler")
    if request.method != "POST":
        return "Method not allowed"
    default_value = "default"
    received: str = request.form.get("key", "default")
    print(f"Received: {received}")
    sarge.run(received)  # Unsafe, don't do that!
    return "Called sarge"

To run the application locally, execute in one terminal:

> flask --debug run

In another terminal, trigger the vulnerability as follows:

> curl -X POST http://localhost:5000/ -d "key=ls"

Now observe that in the terminal running the app, the ls command (provided by the user! 💣) was executed:

/ handler
Received: ls
app.py	__pycache__  README.md	requirements.txt

Wow, pretty scary right! What if I had passed the string rm -Rf ~/*? Now let’s see how to catch this vulnerability with CodeQL.

Running CodeQL on the CLI

To run CodeQL on the CLI, I need to download the CodeQL binaries from the github/codeql-cli-binaries repository. At the time of writing, there are CodeQL binaries for the three major platforms. Where I clone this repository doesn’t matter, as long as the codeql binary ends up in PATH. Then, because I am going to write my own queries (as opposed to solely using the queries shipped with CodeQL), I need to clone CodeQL’s standard library: github/codeql. I recommend putting this repository in a folder that is a sibling of the repository being analyzed. In this manner, the codeql binary will find it automatically.

Before I write my own query, let’s run standard CodeQL queries for Python. First, I need to create a database. Instead of analyzing code at each run, CodeQL’s way of operating is to:

  1. Store the code in a database,
  2. Then run one or many queries on the database.

While I develop a query, and so iterate on step 2 above, having the two steps distinct saves computing time. As long as the code being analyzed doesn’t change, there is no need to rebuild the database. Let’s build the codebase as follows:

> codeql database create --language=python codeql-db --source-root=.

Now that the database is created, let’s call the python-security-and-quality (a set of default queries for Python, provided by CodeQL’s standard library) queries:

> codeql database analyze codeql-db python-security-and-quality --format=sarif-latest --output=codeql.sarif
# Now, transform the SARIF output into CSV, for better human readibility; using https://pypi.org/project/sarif-tools/
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/unused-local-variable,Variable default_value is not used.,app.py,12

Indeed, in the snippet above, it looks like the developer intended to use a variable to store the value "default" but forgot to use it in the end. This is not a security vulnerability, but it exemplifies the kind of programming mistakes that CodeQL’s default rules find. Note that the vulnerability of passing data from the POST request to the sarge.run call is not yet caught. That is because sarge is not in CodeQL’s list of supported Python libraries.

Writing a query to model sarge.run: modeling the source

The sarge.run function executes a command, like subprocess does. As such it is a sink for tainted data: one should make sure that data passed to sarge.run is controlled.

CodeQL performs a modular analysis: it doesn’t inspect the source code of your dependencies. As a consequence, you need to model your dependencies’ behavior for them to be treated correctly by CodeQL’s analysis. Modeling tainted sources and sinks is done by implementing the DataFlow::ConfigSig interface:

/** An input configuration for data flow. */
signature module ConfigSig {
  /** Holds if `source` is a relevant data flow source. */
  predicate isSource(Node source);

  /** Holds if `sink` is a relevant data flow sink. */
  predicate isSink(Node sink);
}

In this snippet, a predicate is a function returning a Boolean, while Node is a class modeling statements in the source code. So to implement isSource I need to capture the Node that we deem relevant sources of tainted data w.r.t. sarge.run. Since any source of tainted data is dangerous if you send its content to sarge.run, I implement isSource as follows:

predicate isSource(DataFlow::Node source) { source instanceof ActiveThreatModelSource }

Threat models control which sources of data are considered dangerous. Usually, only remote sources (data in an HTTP request, packets from the network) are considered dangerous. That’s because, if local sources (content of local files, content passed by the user in the terminal) are tainted, it means an attacker has already such a level of control on your software that you are doomed. That is why, by default, CodeQL’s default threat model is to only consider remote sources.1 In isSource, by using ActiveThreatModelSource, we declare that the sources of interest are the sources of the current active threat model.

To make sure that ActiveThreatModelSource works correctly on my codebase, I write the following test query in file Scratch.ql:

import python
import semmle.python.Concepts

from ActiveThreatModelSource src
select src, "Tainted data source"

Because this file depends on the python APIs of CodeQL, I need to put a qlpack.yml file close to Scratch.ql, as follows:

name: smelc/sarge-queries
version: 0.0.1
extractor: python
library: false
dependencies:
  codeql/python-queries: "*"

I can now execute Scratch.ql as follows:

> codeql database analyze codeql-db queries/Scratch.ql --format=sarif-latest --output=codeql.sarif
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/get-remote-flow-source,Tainted data source,app.py,1

This seems correct: something is flagged. Let’s make it more visual by running the query in VSCode. For that I need to install the CodeQL extension. To run queries within vscode, I first need to specify the database to use. It is the codeql-db folder which we created with codeql database create above:

Selecting the CodeQL database in vscode

Now I run the query by right-clicking in its opened file:

Running the debug query in vscode

Doing so opens the CodeQL results view:

Result of running the debug query

I see that the import of request is flagged as a potential data source. This is correct: in my program, tainted data can come through usages of this package.

Writing a query to model sarge.run: modeling the sink

This is where things gets more interesting. As per the ConfigSig interface above, I need to implement isSink(Node sink), so that it captures calls to sarge.run. Because CodeQL is a declarative2 object-oriented language, this means isSink must return true for subclasses of Node that represent calls to sarge.run. Let me describe a methodology to discover how to do that. First, modify the Scratch.ql query to find out all instances of Node in my application:

import python
import semmle.python.dataflow.new.DataFlow

from DataFlow::Node src
select src, "DataFlow::Node"

Executing this query in VSCode yields the following results:

Result of querying for all nodes

Wow, that’s a lot of results! In a real codebase with multiple files, this would be unmanageable. Fortunately code completion works in CodeQL, so I can filter the results using the where clause, discovering the methods to call by looking at completions on the . symbol. Since the call to sarge.run I am looking for is at line 17, I can refine the query as follows:

from DataFlow::Node src, Location loc
where src.getLocation() = loc
  and loc.getFile().getBaseName() = "app.py"
  and loc.getStartLine() = 17
select src, "DataFlow::Node"

With these constraints, the query returns only a handful of results:

Results of querying some nodes

Still, there are 4 hits on line 17. Let’s see how I can disambiguate those. For this, CodeQL provides the getAQlClass predicate that returns the most specific type a variable has (as explained in CodeQL zero to hero part 3):

from DataFlow::Node src, Location loc
where src.getLocation() = loc
  and loc.getFile().getBaseName() = "app.py"
  and loc.getStartLine() = 17
select src, src.getAQlClass(), "DataFlow::Node"

See how the select clause now includes src.getAQlClass() as second element. This makes the CodeQL Query Results show it in the central column:

Results of getAQlClass

There are many more results, and that is because entries that were indistinguishable before are now disambiguated by the class. If in doubt, one can consult the list of class of CodeQL’s standard Python library to understand what each class is about. In our case, I had read the official documentation on using CodeQL for Python, and I recognize the CallNode class from this list.

As the documentation explains, there is actually an API to retrieve CallNode instances corresponding to functions imported from a distant module, using the moduleImport function. Let’s use it to restrict our Nodes to be instances of CallNode (using a cast) and this call being a call to sarge.run:

import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.ApiGraphs

from DataFlow::Node src
where src.(API::CallNode) = API::moduleImport("sarge").getMember("run").getACall()
select src, "CallNode calling sarge.run"

Executing this query yields the only result we want:

Result of final debug query

Putting this all together, I can finalize the implementation of ConfigSig as shown below. The getArg(0) suffix models that the tainted data flows into sarge.run’s first argument:

private module SargeConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    source instanceof ActiveThreatModelSource
  }

  predicate isSink(DataFlow::Node sink) {
    sink = API::moduleImport("sarge").getMember("run").getACall().getArg(0)
  }
}

Following the official template for queries tracking tainted data, I write the query as follows:

module SargeFlow = TaintTracking::Global<SargeConfig>;

from SargeFlow::PathNode source, SargeFlow::PathNode sink
where SargeFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Tainted data passed to sarge"

Executing this query in VSCode returns the paths (list of steps) along which the vulnerability takes place:

Result of the final query

Conclusion

I have demonstrated how to use CodeQL to model a Python library, covering the setup and steps a developer must do to write his/her first CodeQL query. I gave a methodology to be able to write instances of CodeQL interfaces, even when one is lacking intimate knowledge of CodeQL APIs. I believe this is important, as the CodeQL ecosystem is small and the number of resources is limited: users of CodeQL often have to find out what to write on their own, with limited support from both the tooling and from generative AI tools (probably because the number of resources on CodeQL is small, so the results of generative AI systems are poor too).

To dive deeper, I recommend reading the official CodeQL for Python resource and join the GitHub Security Lab Slack to get support from CodeQL users and developers. And remember that this tutorial’s material is available at tweag/sarge-codeql-minimal if you want to experiment with this tutorial yourself!


  1. The default threat model can be overridden by command line flags and by configuration files.↩
  2. CodeQL belongs to the Datalog family of languages.↩

August 07, 2025 12:00 AM

August 06, 2025

GHC Developer Blog

GHC 9.10.3-rc2 is now available

GHC 9.10.3-rc2 is now available

wz1000 - 2025-08-06

The GHC developers are very pleased to announce the availability of the second release candidate for GHC 9.10.3. Binary distributions, source distributions, and documentation are available at downloads.haskell.org and via GHCup.

GHC 9.10.3 is a bug-fix release fixing over 50 issues of a variety of severities and scopes. A full accounting of these fixes can be found in the release notes. As always, GHC’s release status, including planned future releases, can be found on the GHC Wiki status.

The changes from the first release candidate are:

  • Bumping the text submodule to 2.1.3
  • Reverting a bug fix (!14291) that restricted previously allowed namespace specifiers (#26250)
  • Reverting the bump of the deepseq submodule to 1.5.2.0 (#26251)

This release candidate will have a two-week testing period. If all goes well the final release will be available the week of 19 August 2025.

We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable, Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors whose on-going financial and in-kind support has facilitated GHC maintenance and release management over the years. Finally, this release would not have been possible without the hundreds of open-source contributors whose work comprise this release.

As always, do give this release a try and open a ticket if you see anything amiss.

by ghc-devs at August 06, 2025 12:00 AM

August 04, 2025

Monday Morning Haskell

An Easy Problem Made Hard: Rust & Binary Trees

In last week’s article, we completed our look at Matrix-based problems. Today, we’re going to start considering another data structure: binary trees.

Binary trees are an extremely important structure in programming. Most notably, they are the underlying structure for ordered sets, that allow logarithmic time lookups and insertions. A “tree” is represented by “nodes”, where a node can be “null”, or else hold a value. If it holds a value, it then has a “left” child and a “right” child.

If you take our Solve.hs course, you’ll actually learn to implement an auto-balancing ordered tree set from scratch!

But for these next few articles, we’re going to explore some simple problems that involve binary trees. Today we’ll start with a problem that is very simple (rated as “Easy” by LeetCode), but still helps us grasp the core problem solving techniques behind binary trees. We’ll also encounter some interesting curveballs that Rust can throw at us when it comes to building more complex data structures.

The Problem

Our problem today is Invert Binary Tree. Given a binary tree, we want to return a new tree that is the mirror image of the input tree. For example, if we get this tree as an input:

-    45
    /  \
   32  50
  /  \   \
 5   40   100
    /  \
  37   43

We should output a tree that looks like this:

-    45
    /  \
   50   32
  /    /  \
 100  40   5
     /  \
    43  37

We see that 45 remains the root element, but instead of having 32 on the left and 50 on the right, these two elements are reversed on the next level. Then on the 3rd level, 40 and 5 remain children of 32, but they are also reversed from their prior orientations! This pattern continues all the way down on both sides.

The Algorithm

Binary trees (and in fact, all tree structures) lend themselves very well to recursive algorithms. We’ll use a very simple recursive algorithm here.

If the input node to our function is a “null” node, we will simply return “null” as the output. If the node has a value and children, then we’ll keep that value as the value for our output. However, we will recursively invert both of the child nodes.

Then we’ll take the inverted “left” child node and install it as the “right” child of our result. The inverted “right” child of the original input becomes the “left” child of the resulting node.

Haskell Solution

Haskell is a natural fit for this problem, since it relies so heavily on recursion. We start by defining a recursive TreeNode type. The canonical way to do this is with a Nil constructor as well as a recursive “value” constructor that actually holds the node’s value and refers to the left and right child. For this problem, we’ll just assume our tree holds Int values, so we won’t parameterize it.

data TreeNode = Nil | Node Int TreeNode TreeNode
  deriving (Show, Eq)

Now solving our problem is easy! We pattern match on the input TreeNode. For our first case, we just return Nil for Nil.

invertTree :: TreeNode -> TreeNode
invertTree Nil = Nil
invertTree (Node x left right) = ...

For our second case, we use the same Int value for the value of our result. Then we recursively call invertTree on the right child, but put this in the place of the left child for our new result node. Likewise, we recursively invert the left child of our original and use this result for the right of our result.

invertTree :: TreeNode -> TreeNode
invertTree Nil = Nil
invertTree (Node x left right) = Node x (invertTree right) (invertTree left)

Very easy!

C++ Solution

In a non-functional language, it is still quite possible to solve this problem without recursion, but this is an occasion where we get very nice, clean code with recursion. As a rare treat, we’ll actually start with a C++ solution instead of jumping to Rust right away.

We would start by defining our TreeNode with a struct. Instead of relying on a separate Nil constructor, we use raw pointers for all our tree nodes. This means they can all potentially be nullptr.

struct TreeNode {
    int val;
    TreeNode* left;
    TreeNode* right;

    TreeNode(int v, TreeNode* l, TreeNode* r) : val(v), left(l), right(r) {};
};

TreeNode* invertTree(TreeNode* root) {
    ...
}

And our solution looks almost as easy as the Haskell solution:

TreeNode* invertTree(TreeNode* root) {
    if (root == nullptr) {
        return nullptr;
    }

    return new TreeNode(root->val, invertTree(root->right), invertTree(root->left));
}

Rust Solution

In Rust, it’s not quite as easy to work with recursive structures because of Rust’s memory system. In C++, we used raw pointers, which is fast but can cause significant problems if you aren’t careful (e.g. dereferencing null pointers, or memory leaks). Haskell uses garbage collected memory, which is slow but allows us to write simple code that won’t blow up in weird ways like C++.

Rust’s Memory System

Rust seeks to be fast like C++, while making it hard to do high-risk things like de-referencing a potentially null pointer, or leaking memory. It does this using the concept of “ownership”, and it’s a tricky concept to understand at first.

The ownership model makes it a bit harder for us to write a basic recursive data structures. To write a basic binary tree, you’d have to answer questions like:

  1. Who “owns” the child nodes?
  2. Can I write a function that accesses the child nodes without taking ownership of them? What if I have to modify them?
  3. Can I copy a reference to a child node without copying the entire sub-structure?
  4. Can I create a “new” tree that references part of another tree without copying?

Writing a TreeNode

Here’s the TreeNode struct provided by LeetCode for solving this problem. We can see that references to the nodes themselves are held within 3(!) wrapper types:

#[derive(Debug, PartialEq, Eq)]
pub struct TreeNode {
  pub val: i32,
  pub left: Option<Rc<RefCell<TreeNode>>>,
  pub right: Option<Rc<RefCell<TreeNode>>>,
}

impl TreeNode {
  #[inline]
  pub fn new(val: i32) -> Self {
    TreeNode {
      val,
      left: None,
      right: None
    }
  }
}

pub fn invert_tree(root: Option<Rc<RefCell<TreeNode>>>) -> Option<Rc<RefCell<TreeNode>>> {
}

From inside to outside, here’s what the three wrappers mean:

  1. RefCell is a mutable, shareable container for data.
  2. Rc is a reference counting container. It automatically tracks how many references there are to the RefCell. The cell is de-allocated once this count is 0.
  3. Option is Rust’s equivalent of Maybe. This let’s us use None for an empty tree.

Rust normally only permits a single mutable reference, or multiple immutable references. So RefCell provides mechanics to get multiple mutable references. Let’s see how we can use these to write our invert_tree function.

Solving the Problem

We start by “cloning” the root input reference. Normally, “clone” means a deep copy, but in our case, this doesn’t actually copy the entire tree! Because it is wrapped in Rc, we’re just getting a new reference to the data in RefCell. We conditionally check if this is a Some wrapper. If it is None, we just return the root.

pub fn invert_tree(root: Option<Rc<RefCell<TreeNode>>>) -> Option<Rc<RefCell<TreeNode>>> {
    if let Some(node) = root.clone() {
        ...
    }
    return root;
}

If we didn’t “clone” root, the compiler would complain that we are “moving” the value in the condition, which would invalidate the prior reference to root.

Next, we use borrow_mut to get a mutable reference to the TreeNode inside the RefCell. This node_ref finally gives us something of type TreeNode so that we can work with the individual fields.

pub fn invert_tree(root: Option<Rc<RefCell<TreeNode>>>) -> Option<Rc<RefCell<TreeNode>>> {
    if let Some(node) = root.clone() {
        let mut node_ref = node.borrow_mut();
        ...
    }
    return root;
}

Now for node_ref, both left and right have the full wrapper type Option<Rc<RefCell<TreeNode>>>. We want to recursively call invert_tree on these. Once again though, we have to call clone before passing these to the recursive function.

pub fn invert_tree(root: Option<Rc<RefCell<TreeNode>>>) -> Option<Rc<RefCell<TreeNode>>> {
    if let Some(node) = root.clone() {
        let mut node_ref = node.borrow_mut();

        // Recursively invert left and right subtrees
        let left = invert_tree(node_ref.left.clone());
        let right = invert_tree(node_ref.right.clone());

        ...
    }
    return root;
}

Now because we have a mutable reference in node_ref, we can install these new results as its left and right subtrees!

pub fn invert_tree(root: Option<Rc<RefCell<TreeNode>>>) -> Option<Rc<RefCell<TreeNode>>> {
    if let Some(node) = root.clone() {
        let mut node_ref = node.borrow_mut();

        // Recursively invert left and right subtrees
        let left = invert_tree(node_ref.left.clone());
        let right = invert_tree(node_ref.right.clone());

        // Swap them
        node_ref.left = right;
        node_ref.right = left;
    }
    return root;
}

And now we’re done! We don’t need a separate return statement inside the if. We have modified node_ref, which is still a reference to the same data as root holds. So returning root returns our modified tree.

Conclusion

Even though this was a simple problem with a basic recursive algorithm, we saw how Rust presented some interesting difficulties in applying this algorithm. Languages all make different tradeoffs, so every language has some example where it is difficult to write code that is simple in other languages. For Rust, this is recursive data structures. For Haskell though, it’s things like mutable arrays.

If you want to get some serious practice with binary trees, you should sign up for our problem solving course, Solve.hs. In Module 2, you’ll actually get to implement a balanced tree set from scratch, which is a very interesting and challenging problem that will stretch your knowledge!

by James Bowen at August 04, 2025 08:30 AM

August 02, 2025

Abhinav Sarkar

A Fast Bytecode VM for Arithmetic: The Parser

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics:

This is the first post in a series of posts:

  1. A Fast Bytecode VM for Arithmetic: The Parser
  2. A Fast Bytecode VM for Arithmetic: The Compiler
  3. A Fast Bytecode VM for Arithmetic: The Virtual Machine

In this post, we write the parser for our expression language to an AST, and an AST interpreter.

Introduction

The language that we are going to work with is that of basic arithmetic expressions, with integer values, and addition, subtraction, multiplication and integer division operations. However, our expression language has a small twist: it is possible to introduce a variable using a let binding and use the variable in the expressions in the body of let1. Furthermore, we use the same syntax for let as Haskell does. Here are some examples of valid expressions in our language:

1 + 2 - 3 * 4 + 5 / 6 / 0 + 1
let x = 4 in x + 1
let x = 4 in let y = 5 in x + y
let x = 4 in let y = 5 in x + let z = y in z * z
let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z
let x = (let y = 3 in y + y) in x * 3
let x = let y = 3 in y + y in x * 3
let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3

The only gotcha here is that the body of a let expression extends as far as possible while accounting for nested lets. It becomes clear when we look at parsed expressions later.

The eventual product is a command-line tool that can run different commands. Let’s start with a demo of the tool:

$ arith-vm -h
Bytecode VM for Arithmetic written in Haskell

Usage: arith-vm COMMAND

Available options:
  -h,--help                Show this help text

Available commands:
  parse                    Parse expression to AST
  compile                  Parse and compile expression to bytecode
  disassemble              Disassemble bytecode to opcodes
  decompile                Disassemble and decompile bytecode to expression
  interpret-ast            Parse expression and interpret AST
  interpret-bytecode       Parse, compile and assemble expression, and
                           interpret bytecode
  run                      Run bytecode
  generate                 Generate a random arithmetic expression

$ arith-vm parse -h
Usage: arith-vm parse [FILE]

  Parse expression to AST

Available options:
  FILE                     Input file, pass - to read from STDIN (default)
  -h,--help                Show this help text

$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm parse
( let x = 1 in ( let y = 2 in ( y + ( x * 3 ) ) ) )

$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm compile > a.tbc

$ hexdump -C a.tbc
00000000  00 01 00 00 02 00 03 01  03 00 00 03 00 06 04 02  |................|
00000010  01 02 01                                          |...|
00000013

$ arith-vm disassemble a.tbc
OPush 1
OPush 2
OGet 1
OGet 0
OPush 3
OMul
OAdd
OSwap
OPop
OSwap
OPop

$ arith-vm decompile a.tbc
( let v0 = 1 in ( let v1 = 2 in ( v1 + ( v0 * 3 ) ) ) )

$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm interpret-ast
5

$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm interpret-bytecode
5

$ arith-vm run a.tbc
5

$ arith-vm generate
(
  (
    (
      ( let nD =
        ( 11046 - -20414 ) in
        ( let xqf = ( -15165 * nD ) in nD )
      ) * 26723
    ) /
    (
      ( let phMuOI =
        ( let xQ = ( let mmeBy = -28095 in 22847 ) in 606 ) in 25299
      ) *
      ( let fnoNQm = ( let mzZaZk = 29463 in 18540 ) in ( -2965 / fnoNQm ) )
    )
  ) * 21400
)

We can parse an expression, or compile it to bytecode. We can also disassemble bytecode to opcodes, or decompile it back to an expression. We can interpret an expression either as an AST or as bytecode. We can also run a bytecode file directly. Finally, we have a handy command to generate random expressions for testing/benchmarking purposes2.

Let’s start.

Expressions

Since this is Haskell, we start with listing many language extensions and imports:

{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE UndecidableInstances #-}

module ArithVMLib
  ( Expr(..), Ident(..), Op(..), Pass(..), Error(..), Opcode(..), Bytecode,
    sizedExpr, parse, parseSized, compile', compile, decompile, disassemble,
    exprGen, interpretAST, interpretBytecode', interpretBytecode ) where

import Control.Applicative ((<|>))
import Control.DeepSeq (NFData)
import Control.Exception (Exception, catch, throwIO)
import Control.Monad (unless, void)
import Control.Monad.Except (MonadError (..), runExceptT)
import Control.Monad.ST.Strict (runST)
import Data.Attoparsec.ByteString.Char8 qualified as P
import Data.Bits (shiftL, shiftR, (.&.), (.|.))
import Data.ByteString qualified as BS
import Data.ByteString.Char8 qualified as BSC
import Data.ByteString.Internal qualified as BSI
import Data.ByteString.Unsafe qualified as BS
import Data.Char (toUpper)
import Data.HashMap.Strict qualified as Map
import Data.Hashable (Hashable)
import Data.Int (Int16)
import Data.List qualified as List
import Data.Maybe (fromMaybe)
import Data.Primitive.PrimArray qualified as PA
import Data.Sequence (Seq (..), (|>))
import Data.Sequence qualified as Seq
import Data.Set qualified as Set
import Data.Strict.Tuple (Pair ((:!:)))
import Data.Strict.Tuple qualified as TS
import Data.Word (Word16, Word8)
import Foreign.Ptr (Ptr, minusPtr, plusPtr)
import Foreign.Storable (poke)
import GHC.Generics (Generic)
import Test.QuickCheck qualified as Q
ArithVMLib.hs

We use the GHC2021 extension here that enables a lot of useful GHC extensions by default. We are using the bytestring and attoparsec libraries for parsing, strict, containers and unordered-containers for compilation, deepseq, mtl and primitive for interpreting, and QuickCheck for testing.

The first step is to parse an expression into an Abstract Syntax Tree (AST). We represent the AST as Haskell Algebraic Data Types (ADTs):

data Expr
  = Num !Int16
  | Var !Ident
  | BinOp !Op Expr Expr
  | Let !Ident Expr Expr
  deriving (Eq, Generic)

newtype Ident = Ident {unIdent :: BS.ByteString}
  deriving (Eq, Ord, Generic, Hashable)

data Op = Add | Sub | Mul | Div deriving (Eq, Enum, Generic)

instance NFData Expr

instance Show Expr where
  show = \case
    Num n -> show n
    Var (Ident x) -> BSC.unpack x
    BinOp op a b -> "(" <> show a <> " " <> show op <> " " <> show b <> ")"
    Let (Ident x) a b ->
      "(let " <> BSC.unpack x <> " = " <> show a <> " in " <> show b <> ")"

instance NFData Ident

instance Show Ident where
  show (Ident x) = BSC.unpack x

mkIdent :: String -> Ident
mkIdent = Ident . BSC.pack

instance NFData Op

instance Show Op where
  show = \case
    Add -> "+"
    Sub -> "-"
    Mul -> "*"
    Div -> "/"
ArithVMLib.hs

We add Show instances for ADTs so that we can pretty-print the parsed AST3. Now, we can start parsing.

Parsing Expressions

The EBNF grammar for expressions is as follows:

expr     ::= term | term space* ("+" | "-") term
term     ::= factor | factor space* ("*" | "/") factor
factor   ::= space* (grouping | num | var | let)
grouping ::= "(" expr space* ")"
num      ::= "-"? [1-9] [0-9]*
var      ::= ident
ident    ::= ([a-z] | [A-Z])+
let      ::= "let" space+ ident space* "=" expr space* "in" space+ expr space*
space    ::= " " | "\t" | "\n" | "\f" | "\r"

The expr, term, factor, and grouping productions take care of having the right precedence of arithmetic operations. The num and var productions are trivial. Our language is fairly oblivious of whitespaces; we allow zero-or-more spaces at most places.

The let expressions grammar is pretty standard, except we require one-or-more spaces after the let and in keywords to make them unambiguous.

We use the parser combinator library attoparsec for creating the parser. attoparsec works directly with bytestrings so we don’t incur the cost of decoding unicode characters45.

We write the parser in a top-down fashion, same as the grammar, starting with the expr parser:

type SizedExpr = (Expr, Int)

-- expr ::= term | term space* ("+" | "-") term
exprParser :: P.Parser SizedExpr
exprParser = chainBinOps termParser $ \case
  '+' -> pure Add
  '-' -> pure Sub
  op -> fail $ "Expected '+' or '-', got: " <> show op

-- term ::= factor | factor space* ("*" | "/") factor
termParser :: P.Parser SizedExpr
termParser = chainBinOps factorParser $ \case
  '*' -> pure Mul
  '/' -> pure Div
  op -> fail $ "Expected '*' or '/', got: " <> show op

chainBinOps :: P.Parser SizedExpr -> (Char -> P.Parser Op) -> P.Parser SizedExpr
chainBinOps operandParser operatorParser = operandParser >>= rest
  where
    rest (!expr, !size1) =
      ( do
          P.skipSpace
          c <- P.anyChar
          operator <- operatorParser c
          (operand, !size2) <- operandParser
          rest (BinOp operator expr operand, size1 + size2 + 1)
      ) <|> pure (expr, size1)
{-# INLINE chainBinOps #-}
ArithVMLib.hs

One small complication: our parsers not only return the parsed expressions, but also the number of bytes they occupy when compiled to bytecode. We gather this information while building the AST in parts, and propagate it upward in the tree. We use the bytecode size later in the compilation pass6.

Both exprParser and termParser chain the right higher precedence parsers with the right operators between them7 using the chainBinOps combinator.

-- factor ::= space* (grouping | num | var | let)
factorParser :: P.Parser SizedExpr
factorParser = do
  P.skipSpace
  P.peekChar' >>= \case
    '(' -> groupingParser
    '-' -> numParser
    c | P.isDigit c -> numParser
    c | c /= 'l' -> varParser
    _ -> varParser <|> letParser

-- grouping ::= "(" expr space* ")"
groupingParser :: P.Parser SizedExpr
groupingParser = P.char '(' *> exprParser <* P.skipSpace <* P.char ')'
ArithVMLib.hs

factorParser uses lookahead to dispatch between one of the primary parsers, which is faster than using backtracking. groupingParser simply skips the parenthesis, and recursively calls exprParser.

-- num ::= "-"? [1-9] [0-9]*
numParser :: P.Parser SizedExpr
numParser = do
  n <- P.signed P.decimal P.<?> "number"
  if validInt16 n
    then pure (Num $ fromIntegral n, 3)
    else fail $ "Expected a valid Int16, got: " <> show n
  where
    validInt16 :: Integer -> Bool
    validInt16 i =
      fromIntegral (minBound @Int16) <= i
        && i <= fromIntegral (maxBound @Int16)
ArithVMLib.hs

numParser uses the signed and decimal parsers from the attoparsec library to parse an optionally signed integer. We restrict the numbers to 2-byte integers (-32768–32767 inclusive)8. The <?> helper from attoparsec names parsers so that the error message shown in case of failures point to the right parser.

-- var ::= ident
varParser :: P.Parser SizedExpr
varParser = (,2) . Var <$> identParser

-- ident ::= ([a-z] | [A-Z])+
identParser :: P.Parser Ident
identParser = do
  ident <- P.takeWhile1 P.isAlpha_ascii P.<?> "identifier"
  if isReservedKeyword ident
    then fail $
      "Expected identifier, got: \"" <> BSC.unpack ident
        <> "\", which is a reversed keyword"
    else pure $ Ident ident
{-# INLINE identParser #-}

isReservedKeyword :: BSC.ByteString -> Bool
isReservedKeyword = \case
  "let" -> True
  "in" -> True
  _ -> False
{-# INLINE isReservedKeyword #-}
ArithVMLib.hs

varParser and identParser are straightforward. We restrict identifiers to upper-and-lowercase ASCII alphabetic letters. We also check that our reserved keywords (let and in) are not used as identifiers.

Finally, we write the parser for let expressions:

-- let ::= "let" space+ ident space* "=" expr space* "in" space+ expr space*
letParser :: P.Parser SizedExpr
letParser = do
  expect "let" <* skipSpace1
  !x <- identParser
  P.skipSpace *> expect "="
  (assign, !aSize) <- exprParser
  P.skipSpace *> expect "in" <* skipSpace1
  (body, !bSize) <- exprParser <* P.skipSpace
  pure (Let x assign body, aSize + bSize + 1)
  where
    expect s =
      void (P.string s) <|> do
        found <- P.manyTill P.anyChar (void P.space <|> P.endOfInput)
        let found' = if found == "" then "end-of-input" else "\"" <> found <> "\""
        fail $ "Expected: \"" <> BSC.unpack s <> "\", got: " <> found'

    skipSpace1 = P.space *> P.skipSpace
ArithVMLib.hs

In letParser we use identParser to parse the variable name, and recursively call exprParser to parse the assignment and body expressions, while making sure to correctly parse the spaces. The helper parser expect is used to parse known string tokens (let, = and in), and provide good error messages in case of failures. Talking about error messages …

Error Handling

Let’s figure out an error handling strategy. We use an Error type wrapped in Either to propagate the errors in our program:

data Error = Error !Pass !String
  deriving (Generic)

instance Eq Error where
  (Error _ m1) == (Error _ m2) = m1 == m2

instance Show Error where
  show (Error pass msg) = show pass <> " error: " <> msg

instance NFData Error
instance Exception Error

data Pass
  = Parse
  | Compile
  | Decompile
  | Disassemble
  | InterpretAST
  | InterpretBytecode
  deriving (Show, Eq, Generic)

instance NFData Pass

type Result = Either Error
ArithVMLib.hs

The Error type also captures the Pass in which the error is thrown. Result is a type alias that represents either an error or a result. Finally, we put all the parsers together to write the parse function.

The Parser

Our parseSized function uses the parse function from attoparsec to run the exprParser over an input.

parseSized :: BS.ByteString -> Result SizedExpr
parseSized = processResult . P.parse (exprParser <* P.skipSpace)
  where
    processResult = \case
      P.Done "" res -> pure res
      P.Done leftover _ ->
        throwParseError $
          "Leftover input: \"" <> BSC.unpack leftover <> "\""
      P.Partial f -> processResult $ f ""
      P.Fail _ [] err ->
        throwParseError . capitalize . fromMaybe err $
          List.stripPrefix "Failed reading: " err
      P.Fail "" ctxs _ ->
        throwParseError $
          "Expected: " <> formatExpected ctxs <> ", got: end-of-input"
      P.Fail leftover ctxs _ ->
        throwParseError $
          "Expected: " <> formatExpected ctxs
            <> ", got: \"" <> head (words $ BSC.unpack leftover) <> "\""

    capitalize ~(c : cs) = toUpper c : cs

    formatExpected ctxs = case last ctxs of
      [c] -> "\'" <> [c] <> "\'"
      s -> s

    throwParseError = throwError . Error Parse

parse :: BS.ByteString -> Result Expr
parse = fmap fst . parseSized
{-# INLINE parse #-}
ArithVMLib.hs

The processResult function deals with intricacies of how attoparsec returns the parsing result. Basically, we inspect the returned result and throw appropriate errors with useful error messages. We use throwError from the MonadError typeclass that works with all its instances, which Either is one of.

Finally, we throw away the bytecode size from the result of parseSized in the parse function.

The parser is done. But as good programmers, we must make sure that it works correctly. Let’s write some unit tests.

Testing the Parser

We use the hspec library to write unit tests for our program. Each test is written as a spec9.

{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import ArithVMLib
import Control.Arrow ((>>>))
import Control.Monad (forM_, (>=>))
import Data.ByteString.Char8 qualified as BSC
import Data.Int (Int16)
import Data.Sequence qualified as Seq
import Test.Hspec
import Test.Hspec.QuickCheck
import Test.QuickCheck qualified as Q

parserSpec :: Spec
parserSpec = describe "Parser" $ do
  forM_ parserSuccessTests $ \(input, result) ->
    it ("parses: \"" <> BSC.unpack input <> "\"") $ do
      (show <$> parse input) `shouldBe` Right result

  forM_ parserErrorTests $ \(input, err) ->
    it ("fails for: \"" <> BSC.unpack input <> "\"") $ do
      parse input `shouldSatisfy` \case
        Left (Error Parse msg) | err == msg -> True
        _ -> False

parserSuccessTests :: [(BSC.ByteString, String)]
parserSuccessTests =
  [ ( "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1",
      "((((1 + 2) - (3 * 4)) + ((5 / 6) / 0)) + 1)"
    ),
    ( "1+2-3*4+5/6/0+1",
      "((((1 + 2) - (3 * 4)) + ((5 / 6) / 0)) + 1)"
    ),
    ( "1 + -1",
      "(1 + -1)"
    ),
    ( "let x = 4 in x + 1",
      "(let x = 4 in (x + 1))"
    ),
    ( "let x=4in x+1",
      "(let x = 4 in (x + 1))"
    ),
    ( "let x = 4 in let y = 5 in x + y",
      "(let x = 4 in (let y = 5 in (x + y)))"
    ),
    ( "let x = 4 in let y = 5 in x + let z = y in z * z",
      "(let x = 4 in (let y = 5 in (x + (let z = y in (z * z)))))"
    ),
    ( "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z",
      "(let x = 4 in ((let y = 5 in (x + 1)) + (let z = 2 in (z * z))))"
    ),
    ( "let x=4in 2+let y=x-5in x+let z=y+1in z/2",
      "(let x = 4 in (2 + (let y = (x - 5) in (x + (let z = (y + 1) in (z / 2))))))"
    ),
    ( "let x = (let y = 3 in y + y) in x * 3",
      "(let x = (let y = 3 in (y + y)) in (x * 3))"
    ),
    ( "let x = let y = 3 in y + y in x * 3",
      "(let x = (let y = 3 in (y + y)) in (x * 3))"
    ),
    ( "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3",
      "(let x = (let y = (1 + (let z = 2 in (z * z))) in (y + 1)) in (x * 3))"
    )
  ]

parserErrorTests :: [(BSC.ByteString, String)]
parserErrorTests =
  [ ("", "Not enough input"),
    ("1 +", "Leftover input: \"+\""),
    ("1 & 1", "Leftover input: \"& 1\""),
    ("1 + 1 & 1", "Leftover input: \"& 1\""),
    ("1 & 1 + 1", "Leftover input: \"& 1 + 1\""),
    ("(", "Not enough input"),
    ("(1", "Expected: ')', got: end-of-input"),
    ("(1 + ", "Expected: ')', got: \"+\""),
    ("(1 + 2", "Expected: ')', got: end-of-input"),
    ("(1 + 2}", "Expected: ')', got: \"}\""),
    ("66666", "Expected a valid Int16, got: 66666"),
    ("-x", "Expected: number, got: \"-x\""),
    ("let 1", "Expected: identifier, got: \"1\""),
    ("let x = 1 in ", "Not enough input"),
    ( "let let = 1 in 1",
      "Expected identifier, got: \"let\", which is a reversed keyword"
    ),
    ( "let x = 1 in in",
      "Expected identifier, got: \"in\", which is a reversed keyword"
    ),
    ("let x=1 inx", "Expected: space, got: \"x\""),
    ("letx = 1 in x", "Leftover input: \"= 1 in x\""),
    ("let x ~ 1 in x", "Expected: \"=\", got: \"~\""),
    ("let x = 1 & 2 in x", "Expected: \"in\", got: \"&\""),
    ("let x = 1 inx", "Expected: space, got: \"x\""),
    ("let x = 1 in x +", "Leftover input: \"+\""),
    ("let x = 1 in x in", "Leftover input: \"in\""),
    ("let x = let x = 1 in x", "Expected: \"in\", got: end-of-input")
  ]
ArithVMSpec.hs

We have a bunch of tests for the parser, testing both success and failure cases. Notice how spaces are treated in the expressions. Also notice how the let expressions are parsed. We’ll add property-based tests for the parser in the next post.

There is not much we can do with the parsed ASTs at this point. Let’s write an interpreter to evaluate our ASTs.

The AST Interpreter

The AST interpreter is a standard and short recursive interpreter with an environment mapping variables to their values:

interpretAST :: Expr -> Result Int16
interpretAST = go Map.empty
  where
    go env = \case
      Num n -> pure n
      Var x -> case Map.lookup x env of
        Just v -> pure v
        Nothing -> throwError . Error InterpretAST $
          "Unknown variable: " <> BSC.unpack (unIdent x)
      BinOp op a b -> do
        !a' <- go env a
        !b' <- go env b
        interpretOp InterpretAST a' b' op
      Let x assign body -> do
        !val <- go env assign
        go (Map.insert x val env) body

interpretOp :: (MonadError Error m) => Pass -> Int16 -> Int16 -> Op -> m Int16
interpretOp pass a b = \case
  Add -> pure $! a + b
  Sub -> pure $! a - b
  Mul -> pure $! a * b
  Div | b == 0 -> throwError $ Error pass "Division by zero"
  Div | b == (-1) && a == minBound -> throwError $ Error pass "Arithmetic overflow"
  Div -> pure $! a `div` b
{-# INLINE interpretOp #-}
ArithVMLib.hs

This interpreter serves both as a performance baseline for the bytecode VM we write later, and as a definitional interpreter for testing the VM. We extract the interpretOp helper function for later reuse10. interpretOp is careful in detecting division-by-zero and arithmetic overflow errors, but we ignore possible integer overflow/underflow errors that may be caused by the arithmetic operations.

Testing the Interpreter

We write some unit tests for the interpreter following the same pattern as the parser:

astInterpreterSpec :: Spec
astInterpreterSpec = describe "AST interpreter" $ do
  forM_ astInterpreterSuccessTests $ \(input, result) ->
    it ("interprets: \"" <> BSC.unpack input <> "\"") $ do
      parseInterpret input `shouldBe` Right result

  forM_ astInterpreterErrorTests $ \(input, err) ->
    it ("fails for: \"" <> BSC.unpack input <> "\"") $ do
      parseInterpret input `shouldSatisfy` \case
        Left (Error InterpretAST msg) | err == msg -> True
        _ -> False
  where
    parseInterpret = parse >=> interpretAST

astInterpreterSuccessTests :: [(BSC.ByteString, Int16)]
astInterpreterSuccessTests =
  [ ("1", 1),
    ("1 + 2 - 3 * 4 + 5 / 6 / 1 + 1", -8),
    ("1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)", -3),
    ("1 + -1", 0),
    ("1 * -1", -1),
    ("let x = 4 in x + 1", 5),
    ("let x = 4 in let x = x + 1 in x + 2", 7),
    ("let x = 4 in let y = 5 in x + y", 9),
    ("let x = 4 in let y = 5 in x + let z = y in z * z", 29),
    ("let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z", 13),
    ("let x = let y = 3 in y + y in x * 3", 18),
    ("let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3", 18)
  ]

astInterpreterErrorTests :: [(BSC.ByteString, String)]
astInterpreterErrorTests =
  [ ("x", "Unknown variable: x"),
    ("let x = 4 in y + 1", "Unknown variable: y"),
    ("let x = y + 1 in x", "Unknown variable: y"),
    ("let x = x + 1 in x", "Unknown variable: x"),
    ("1/0", "Division by zero"),
    ("-32768 / -1", "Arithmetic overflow")
  ]
ArithVMSpec.hs

Now, we can run the parser and interpreter tests to make sure that everything works correctly.

main :: IO ()
main = hspec $ do
  parserSpec
  astInterpreterSpec
ArithVMSpec.hs
Output of the test run
$ cabal test -O2
Running 1 test suites...
Test suite specs: RUNNING...

Parser
  parses: "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1" [✔]
  parses: "1+2-3*4+5/6/0+1" [✔]
  parses: "1 + -1" [✔]
  parses: "let x = 4 in x + 1" [✔]
  parses: "let x=4in x+1" [✔]
  parses: "let x = 4 in let y = 5 in x + y" [✔]
  parses: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
  parses: "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z" [✔]
  parses: "let x=4in 2+let y=x-5in x+let z=y+1in z/2" [✔]
  parses: "let x = (let y = 3 in y + y) in x * 3" [✔]
  parses: "let x = let y = 3 in y + y in x * 3" [✔]
  parses: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
  fails for: "" [✔]
  fails for: "1 +" [✔]
  fails for: "1 & 1" [✔]
  fails for: "1 + 1 & 1" [✔]
  fails for: "1 & 1 + 1" [✔]
  fails for: "(" [✔]
  fails for: "(1" [✔]
  fails for: "(1 + " [✔]
  fails for: "(1 + 2" [✔]
  fails for: "(1 + 2}" [✔]
  fails for: "66666" [✔]
  fails for: "-x" [✔]
  fails for: "let 1" [✔]
  fails for: "let x = 1 in " [✔]
  fails for: "let let = 1 in 1" [✔]
  fails for: "let x = 1 in in" [✔]
  fails for: "let x=1 inx" [✔]
  fails for: "letx = 1 in x" [✔]
  fails for: "let x ~ 1 in x" [✔]
  fails for: "let x = 1 & 2 in x" [✔]
  fails for: "let x = 1 inx" [✔]
  fails for: "let x = 1 in x +" [✔]
  fails for: "let x = 1 in x in" [✔]
  fails for: "let x = let x = 1 in x" [✔]
AST interpreter
  interprets: "1" [✔]
  interprets: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
  interprets: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
  interprets: "1 + -1" [✔]
  interprets: "1 * -1" [✔]
  interprets: "let x = 4 in x + 1" [✔]
  interprets: "let x = 4 in let x = x + 1 in x + 2" [✔]
  interprets: "let x = 4 in let y = 5 in x + y" [✔]
  interprets: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
  interprets: "let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z" [✔]
  interprets: "let x = let y = 3 in y + y in x * 3" [✔]
  interprets: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
  fails for: "x" [✔]
  fails for: "let x = 4 in y + 1" [✔]
  fails for: "let x = y + 1 in x" [✔]
  fails for: "let x = x + 1 in x" [✔]
  fails for: "1/0" [✔]
  fails for: "-32768 / -1" [✔]

Finished in 0.0058 seconds
54 examples, 0 failures
Test suite specs: PASS

Awesome, it works! That’s it for this post. Let’s update our checklist:

In the next part, we write a bytecode compiler for our expression AST.


  1. Variables are scoped to the body of the let expressions they are introduced in, that is, our language has lexical scoping. Also, variables with same name in inner lets shadow the variables in outer lets.↩︎

  2. If you are wondering why do this at all, when we can directly run the expressions while parsing, I think this is a great little project to learn how to write performant bytecode compilers and VMs in Haskell.↩︎

  3. Bangs (!) that enforce strictness are placed in the Expr ADT (and also in the later code) at the right positions that provide performance benefits. The right positions were found by profiling the program. A bang placed at a wrong position (for example in front of Expr inside BinOp) may ruin the compiler provided optimizations and make the overall program slower.↩︎

  4. attoparsec is very fast, but there are faster parsing libraries in Haskell. On the other hand, attoparsec does not provided great error messages. If the user experience were a higher priority, I’d use the megaparsec library. I find attoparsec to have the right balance of performance, developer experience and user experience. Handwritten parsers from scratch could be faster, but they’d be harder to maintain and use.↩︎

  5. I wrote the first version of the parser using the ReadP library that comes with Haskell standard library. I rewrote it to use attoparsec and found that the rewritten parser was more than 10x faster.↩︎

  6. You don’t need to think about the bytecode size of expressions right now. It’ll become clear when we go over compilation in the next post.↩︎

  7. Certain functions such as chainBinOps are inlined using the INLINE pragma to improve the program performance. The functions to inline were chosen by profiling.↩︎

  8. Since the numbers need to be encoded into bytes when we compile to bytecode, we need to choose some encoding for them. For simpler code, we choose 2-byte integers.↩︎

  9. Testing your parsers is crucial because that’s your programming languages’ interface to the users. Also because writing (fast) parsers is difficult and error-prone. Most of the bugs I found in this program were in the parser.↩︎

  10. Again, notice the carefully placed bangs to enforce strictness. Try to figure out why they are placed at some places and not at others.↩︎

If you liked this post, please leave a comment.

by Abhinav Sarkar (abhinav@abhinavsarkar.net) at August 02, 2025 12:00 AM

August 01, 2025

Lysxia's blog

Twentyseven 1.0.0

Twelve years of Haskell

Twentyseven is a Rubik’s cube solver and one of my earliest projects in Haskell. The first commit dates from January 2014, and version 0.0.0 was uploaded on Hackage in March 2016.

I first heard of Haskell in a course on lambda calculus in 2013. A programming language with lazy evaluation sounded like a crazy idea, so I gave it a try. Since then, I have kept writing in Haskell as my favorite language. For me it is the ideal blend of programming and math. And a Rubik’s cube solver is a great excuse for doing group theory.

Twentyseven 1.0.0 is more of a commemorative release for myself, with the goal of making it compile with the current version of GHC (9.12). There was surprisingly little breakage:

  1. Semigroup has become a superclass of Monoid
  2. A breaking change in the Template Haskell AST

Aside from that, the code is basically just as it was 9 years ago, including design decisions that I would find questionable today. For example, I use unsafePerformIO to read precomputed tables into top-level constants, but the location of the files to read from can be configured by command-line arguments, so I better make sure that the tables are not forced before the location is set…

How Twentyseven works

The input of the program is a string enumerating the 54 facelets of a Rubik’s cube, each character represents one color.

DDDFUDLRB FUFDLLLRR UBLBFDFUD ULBFRULLB RRRLBBRUB UBFFDFDRU

The facelets follow the order pictured below. They are grouped by faces (up, left, front, right, back, top), and in each face they are listed in top-down, left-right order.

                  00 01 02
                  03 04 05
                  06 07 08

        10 11 12  20 21 22  30 31 32  40 41 42
        13 14 15  23 24 25  33 34 35  43 44 45
        16 17 18  26 27 28  36 37 38  46 47 48

                  50 51 52
                  53 54 55
                  56 57 58

The output is a sequence of moves to solve that cube.

U L B' L R2 D R U2 F U2 L2 B2 U B2 D' B2 U' R2 U L2 R2 U

The implementation of Twentyseven is based on Herbert Kociemba’s notes about Cube Explorer, a program written in Pascal!

The search algorithm is iterative deepening A*, or IDA*. Like A*, IDA* finds the shortest path between two vertices in a graph. A conventional A* is not feasible because the state space of a Rubik’s cube is massive (43 252 003 274 489 856 000 states, literally billions of billions). Instead, we run a series of depth-first searches with a maximum allowed number of moves that increases for each search. As it is based on depth-first search, IDA* only needs memory for the current path, which is super cheap.

IDA* relies on an estimate of the number of moves remaining to reach the solved state. We obtain such an estimate by projecting the Rubik’s cube state into a simpler puzzle. For example, we can consider only the permutation of corners, ignoring their orientation. We can pre-compute a table mapping each corner permutation (there are 8! = 40320) to the minimum number of moves to put the corners back to their location. This is a lower bound on the number of moves to actually solve a Rubik’s cube. Different projections yield different lower bounds (for example, by looking at the permutation of edges instead, or their orientation), and we can combine lower bounds into their maximum, yielding a more precise lower bound, and thus a faster IDA*.

Putting all that together, we obtain an optimal solver for Rubik’s cubes. But even with these heuristics, Twentyseven can take hours to solve a random cube optimally. Kociemba’s Cube Explorer is apparently much faster (I’ve never tried it myself). My guess is that the difference is due to a better selection of projections, yielding better heuristics. But I haven’t gotten around to figure out whether I’ve misinterpreted his notes or those improvements can only be found in the code.

A faster alternative is Kociemba’s two phase algorithm. It is suboptimal, but it solves Rubik’s cubes in a fraction of a second (1000 cubes per minute). The first phase puts cubies into a “common orientation” and “separates” the edges into two groups. In other words, we reach a state where the permutation of 12 edges can be decomposed into two disjoint permutations of 4 and 8 edges respectively. In the second phase, we restrict the possible moves: quarter- and half-turns on the top and bottom faces, half-turns only on the other faces. These restricted moves preserve the “common orientation” of edges and corners from phase 1, and the edges in the middle slice stay in their slice. Each phase thus performs an IDA* search in a much smaller space than the full Rubik’s cube state space (2 217 093 120 and 19 508 428 800 states respectively).

by Lysxia at August 01, 2025 12:00 AM

July 31, 2025

Tweag I/O

Integrating Nix and Buck2

Buck2 is a new open source build system developed by Meta (Facebook) which we already looked at before in some depth, see A Tour Around Buck2, Meta’s New Build System. Since then, Buck2 has gained significant improvements in user experience and language support, making it an increasingly attractive option in the build systems space.

At Tweag, we adhere to high standards for reproducible builds, which Buck2 doesn’t fully uphold in its vanilla configuration. In this post, we will introduce our ruleset that provides integration with Nix. I’ll demonstrate how it can be used, and you will gain insights into how to leverage Nix to achieve more reliable and reproducible builds with Buck2.

Reproducibility, anyone?

In short, Buck2 is a fast, polyglot build tool very similar to Bazel. Notably, it also provides fine-grained distributed caching and even speaks (in its open source variant) the same remote caching and execution protocols used by Bazel. This means you’re able to utilize the same Bazel services available for caching and remote execution.

However, in contrast to Bazel, Buck2 uses a remote first approach and does not restrict build actions using a sandbox on the local machine. As a result build actions can be non-hermetic, meaning their outcome might depend on what files or programs happen to be present on the local machine. This lack of hermeticity can lead to non-reproducible builds, which is a critical concern for the effective caching of build artifacts.

Non-hermeticity issues can be elusive, often surfacing unexpectedly for new developers which effects on-boarding new team members, or open source contributors. If left undetected, they can even cause problems down the line in production, which is why we think reproducible builds are important!

Achieving Reproducibility with Nix

If we want reproducible builds, we must not rely on anything installed on the local machine. We need to precisely control every compiler and build tool which is used in our project. Although defining each and every one of these inside the Buck2 build itself is possible, it also would be a lot of work. The solution to this problem can be Nix.

Nix is a package manager and build system for Linux and Unix-like operating systems. With nixpkgs, there is a very large and comprehensive collection of software packaged using Nix, which is extensible and can be adapted to one’s needs. Most importantly, Nix already strictly enforces hermeticity for its package builds and the nixpkgs collection goes to great lengths to achieve reproducible builds.

So, using Nix to provide compilers and build tools for Buck2 is a way to benefit from that preexisting work and introduce hermetic toolchains into a Buck2 build.

Let’s first quickly look into the Nix setup and proceed with how we can integrate it into Buck2 later.

Nix with flakes

After installing Nix, the nix command is available, and we can start declaring dependencies on packages from nixpkgs in a nix file. The Nix tool uses the Nix language, a domain-specific, purely functional and lazily evaluated programming language to define packages and declare dependencies. The language has some wrinkles, but don’t worry; we’ll only use basic expressions without delving into the more advanced concepts.

For example, here is a simple flake.nix which provides the Rust compiler as a package output:

{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";
  };
  outputs = { self, nixpkgs }:
    {
      packages = {
        aarch64-darwin.rustc = nixpkgs.legacyPackages.aarch64-darwin.rustc;
        x86_64-linux.rustc = nixpkgs.legacyPackages.x86_64-linux.rustc;
      }
    };
}

Note: While flakes have been widely used for a long time, the feature still needs to be enabled explicitly by setting extra-experimental-features = nix-command flakes in the configuration. See the wiki for more information.

In essence, a Nix flake is a Nix expression following a specific schema. It defines its inputs (usually other flakes) and outputs (e.g. packages) which depend on the inputs. In this example the rustc package from nixpkgs is re-used for the output of this flake, but more complex expressions could be used just as well.

Inspecting this flake shows the following output:

$ nix flake show --all-systems
path:/source/project?lastModified=1745857313&narHash=sha256-e1sxfj1DZbRjhHWF7xfiI3wc1BpyqWQ3nLvXBKDya%2Bg%3D
└───packages
    ├───aarch64-darwin
    │   └───rustc: package 'rustc-wrapper-1.86.0'
    └───x86_64-linux
        └───rustc: package 'rustc-wrapper-1.86.0'

In order to build the rustc package output, we can call Nix in the directory of the flake.nix file like this: nix build '.#rustc'. This will either fetch pre-built artifacts of this package from a binary cache if available, or directly build the package if not. The result is the same in both cases: the rustc package output will be available in the local nix store, and from there it can be used just like other software on the system.

$ nix build --print-out-paths '.#rustc'
/nix/store/ssid482a107q5vw18l9millwnpp4rgxb-rustc-wrapper-1.86.0-man
/nix/store/szc39h0qqfs4fvvln0c59pz99q90zzdn-rustc-wrapper-1.86.0

The output displayed above illustrates that a Nix build of a single package can produce multiple outputs. In this case the rustc package was split into a default output and an additional, separate output for the man pages.

The default output contains the main binaries such as the Rust compiler:

$ /nix/store/szc39h0qqfs4fvvln0c59pz99q90zzdn-rustc-wrapper-1.86.0/bin/rustc --version
rustc 1.86.0 (05f9846f8 2025-03-31) (built from a source tarball)

It is also important to note that the output of a Nix package depends on the specific nixpkgs revision stored in the flake.lock file, rather than any changes in the local environment. This ensures that each developer checking out the project at any point in time will receive the exact same (reproducible) output no matter what.

Using Buck2

As part of our work for Mercury, a company providing financial services, we developed rules for Buck2 which can be used to integrate packages provided by a nix flake as part of a project’s build. Recently, we have been able to publish these rules, called buck2.nix, as open source under the Apache 2 license.

To use these rules, you need to make them available in your project first. Add the following configuration to your .buckconfig:

[cells]
  nix = none

[external_cells]
  nix = git

[external_cell_nix]
  git_origin = https://github.com/tweag/buck2.nix.git
  commit_hash = accae8c8924b3b51788d0fbd6ac90049cdf4f45a # change to use a different version

This configures a cell called nix to be fetched from the specified repository on GitHub. Once set up, you can refer to that cell in your BUCK files and load rules from it.

Note: for clarity, I am going to indicate the file name in the top most comment of a code block when it is not obvious from the context already

To utilize a Nix package from Buck2, we need to introduce a new target that runs nix build inside of a build action producing a symbolic link to the nix store path as the build output. Here is how to do that using buck2.nix:

# BUCK

load("@nix//flake.bzl", "flake")

flake.package(
    name = "rustc",
    binary = "rustc",
    path = "nix", # path to a nix flake
    package = "rustc", # which package to build, default is the value of the `name` attribute
    output = "out", # which output to build, this is the default
)

Note: this assumes the flake.nix and accompanying flake.lock file is found alongside the BUCK file in the nix subdirectory

With this build file in place, a new target called rustc is made available which builds the output called out of the rustc package of the given flake. This target can be used as a dependency of other rules in order to generate an output artifact:

# BUCK

genrule(
   name = "rust-info",
   out = "rust-info.txt",
   cmd = "$(exe :rustc) --version > ${OUT}"
)

Note: Buck2 supports expanding references in string parameters using macros, such as the $(exe ) part in the cmd parameter above which expands to the path of the executable output of the :rustc target

Using Buck2 (from nixpkgs of course!) to build the rust-info target yields:

$ nix run nixpkgs#buck2 -- build --show-simple-output :rust-info
Build ID: f3fec86b-b79f-4d8e-80c7-acea297d4a64
Loading targets.   Remaining     0/10                                                                                    24 dirs read, 97 targets declared
Analyzing targets. Remaining     0/20                                                                                    5 actions, 5 artifacts declared
Executing actions. Remaining     0/5                                                                                     9.6s exec time total
Command: build.    Finished 2 local
Time elapsed: 10.5s
BUILD SUCCEEDED
buck-out/v2/gen/root/904931f735703749/__rust-info__/out/rust-info.txt

$ cat buck-out/v2/gen/root/904931f735703749/__rust-info__/out/rust-info.txt
rustc 1.86.0 (05f9846f8 2025-03-31) (built from a source tarball)

For this one-off command we just ran buck2 from the nixpkgs flake on the current system. This is nice for illustration, but it is also not reproducible, and you’ll probably end up with a different Buck2 version when you try this on your machine.

In order to provide the same Buck2 version consistently, let’s add another Nix flake to our project:

# flake.nix

{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";
  };
  outputs = { self, nixpkgs }:
    {
      devShells.aarch64-darwin.default =
        nixpkgs.legacyPackages.aarch64-darwin.mkShellNoCC {
          name = "buck2-shell";
          packages = [ nixpkgs.legacyPackages.aarch64-darwin.buck2 ];
        };

      devShells.x86_64-linux.default =
        nixpkgs.legacyPackages.x86_64-linux.mkShellNoCC {
          name = "buck2-shell";
          packages = [ nixpkgs.legacyPackages.x86_64-linux.buck2 ];
        };
    };

  nixConfig.bash-prompt = "(nix) \\$ "; # visual clue if inside the shell
}

This flake defines a default development environment, or dev shell for short. It uses the mkShellNoCC function from nixpkgs which creates an environment where the programs from the given packages are available in PATH.

After entering the shell by running nix develop in the directory of the flake.nix file, the buck2 command has the exact same version for everyone working on the project as long as the committed flake.lock file is not changed. For convenience, consider using direnv which automates entering the dev shell as soon as changing into the project directory.

Hello Rust

With all of that in place, let’s have a look at how to build something more interesting, like a Rust project.

Similar to the genrule above, it would be possible to define custom rules utilizing the :rustc target to compile real-world Rust projects. However, Buck2 already ships with rules for various languages in its prelude, including rules to build Rust libraries and binaries.

In a default project setup with Rust these rules would simply use whatever Rust compiler is installed in the system, which may cause build failures due to version mismatches.

To avoid this non-hermeticity, we’re going to instruct the Buck2 rules to use our pinned Rust version from nixpkgs.

Let’s start by preparing such a default setup for the infamous “hello world” example in Rust:

# src/hello.rs

fn main() {
    println!("Hello, world!");
}
# src/BUCK

rust_binary(
    name = "hello",
    srcs = ["hello.rs"],
)

Toolchains

What’s left to do to make these actually work is to provide a Rust toolchain. In this context, a toolchain is a configuration that specifies a set of tools for building a project, such as the compiler, the linker, and various command-line tools. In this way, toolchains are decoupled from the actual rule definitions and can be easily changed to suit one’s needs.

In Buck2, toolchains are expected to be available in the toolchains cell under a specific name. Conventionally, the toolchains cell is located in the toolchains directory of a project. For example, all the Rust rules depend on the target toolchains//:rust which is defined in toolchains/BUCK and must provide Rust specific toolchain information.

Luckily, we do not need to define a toolchain rule ourselves but can re-use the nix_rust_toolchain rule from buck2.nix:

# toolchains/BUCK

load("@nix//toolchains:rust.bzl", "nix_rust_toolchain")

flake.package(
    name = "clippy",
    binary = "clippy-driver",
    path = "nix",
)

flake.package(
    name = "rustc",
    binaries = ["rustdoc"],
    binary = "rustc",
    path = "nix",
)

nix_rust_toolchain(
    name = "rust",
    clippy = ":clippy",
    default_edition = "2021",
    rustc = ":rustc",
    rustdoc = ":rustc[rustdoc]",
    visibility = ["PUBLIC"],
)

The rustc target is defined almost identically as before, but the nix_rust_toolchain rule also expects the rustdoc attribute to be present. In this case, the rustdoc binary is available from the rustc Nix package as well and can be referenced using the sub-target syntax :rustc[rustdoc] which refers to the corresponding item of the binaries attribute given to the flake.package rule.

Additionally, we need to pass in the clippy-driver binary, which is available from the clippy package in the nixpkgs collection. Thus, the flake.nix file needs to be changed by adding the clippy package outputs:

# toolchains/nix/flake.nix

{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";
  };
  outputs =
    {
      self,
      nixpkgs,
    }:
    {
      packages = {
        aarch64-darwin.rustc = nixpkgs.legacyPackages.aarch64-darwin.rustc;
        aarch64-darwin.clippy = nixpkgs.legacyPackages.aarch64-darwin.clippy;
        x86_64-linux.rustc = nixpkgs.legacyPackages.x86_64-linux.rustc;
        x86_64-linux.clippy = nixpkgs.legacyPackages.x86_64-linux.clippy;
      }
    };
}

At this point we are able to successfully build and run the target src:hello:

(nix) $ buck2 run src:hello
Build ID: 530a4620-bfb2-454d-bae1-e937ae9e764f
Analyzing targets. Remaining     0/53                                                                                    75 actions, 101 artifacts declared
Executing actions. Remaining     0/11                                                                                    1.1s exec time total
Command: run.      Finished 3 local
Time elapsed: 0.7s
BUILD SUCCEEDED
Hello, world!

Building a real-world Rust project would be a bit more involved. Here is an interesting article how one can do that using Bazel.

Note that buck2.nix currently also provides toolchain rules for C/C++ and Python. Have a look at the example project provided by buck2.nix, which you can directly use as a template to start your own project:

$ nix flake new --template github:tweag/buck2.nix my-project

A big thank you to Mercury for their support and for encouraging us to share these rules as open source! If you’re looking for a different toolchain or have other suggestions, feel free to open a new issue. Pull requests are very welcome, too!

If you’re interested in exploring a more tightly integrated solution, you might want to take a look at the buck2-nix project, which also provides Nix integration. Since it defines an alternative prelude that completely replaces Buck2’s built-in rules, we could not use it in our project but drew good inspiration from it.

Conclusion

With the setup shown, we saw that all that is needed really is Nix (pun intended1):

  • we provide the buck2 binary with Nix as part of a development environment
  • we leverage Nix inside Buck2 to provide build tools such as compilers, their required utilities and third-party libraries in a reproducible way

Consequently, onboarding new team members no longer means following seemingly endless and quickly outdated installation instructions. Installing nix is easy; entering the dev shell is fast, and you’re up and running in no time!

And using Buck2 gives us fast, incremental builds by only building the minimal set of dependencies needed for a specific target.

Next time, I will delve into how we seamlessly integrated the Haskell toolchain libraries from Nix and how we made it fast as well.


  1. The name Nix is derived from the Dutch word niks, meaning nothing; build actions don’t see anything that hasn’t been explicitly declared as an input

July 31, 2025 12:00 AM

July 28, 2025

Monday Morning Haskell

Spiral Matrix: Another Matrix Layer Problem

In last week’s article, we learned how to rotate a 2D Matrix in place using Haskell’s mutable array mechanics. This taught us how to think about a Matrix in terms of layers, starting from the outside and moving in towards the center.

Today, we’ll study one more 2D Matrix problem that uses this layer-by-layer paradigm. For more practice dealing with multi-dimensional arrays, check out our Solve.hs course! In Module 2, you’ll study all kinds of different data structures in Haskell, including 2D Matrices (both mutable and immutable).

The Problem

Today’s problem is Spiral Matrix. In this problem, we receive a 2D Matrix, and we would like to return the elements of that matrix in a 1D list in “spiral order”. This ordering consists of starting from the top left and going right. When we hit the top right corner, we move down to the bottom. The we come back across the bottom row to the left, and then back up the top left. Then we continue this process on inner layers.

So, for example, let’s suppose we have this 4x4 matrix:

1   2  3  4
5   6  7  8
9  10 11 12
13 14 15 16

This should return the following list:

[1,2,3,4,8,12,16,15,14,13,9,5,6,7,11,10]

At first glance, it seems like a lot of our layer-by-layer mechanics from last week will work again. All the numbers in the “first” layer come first, followed by the “second” layer, and so on. The trick though is that for this problem, we have to handle non-square matrices. So we can also have this matrix:

1  2  3  4
5  6  7  8
9 10 11 12

This should yield the list [1,2,3,4,8,12,11,10,9,5,6,7]. This isn’t a huge challenge, but we need a slightly different approach.

The Algorithm

We still want to generally move through the Matrix using a layer-by-layer approach. But instead of tracking the 4 corner points, we’ll just keep track of 4 “barriers”, imaginary lines dictating the “end” of each dimension (up/down/left/right) for us to scan. These barriers will be inclusive, meaning that they refer to the last valid row or column in that direction. We would call these “min row”, “min column”, “max row” and “max column”.

Now the general process for going through a layer will consist of 4 steps. Each step starts in a corner location and proceeds in one direction until the next corner is reached. Then, we can start again with the next layer.

The trick is the end condition. Because we can have rectangular matrices, the final layer can have a shape like 1 x n or n x 1, and this is a problem, because we wouldn’t need 4 steps. Even a square matrix of n x n with odd n would have a 1x1 as its final layer, and this is also a problem since it is unclear which “corner” this coordinate

Thus we have to handle these edge cases. However, they are easy to both detect and resolve. We know we are in such a case when “min row” and “max row” are equal, or if “min column” and “max column” are equal. Then to resolve the case, we just do one pass instead of 4, including both endpoints.

Rust Solution

For our Rust solution, let’s start by defining important terms, like we always do. For our terms, we’ll mainly be dealing with these 4 “barrier” values, the min and max for the current row and column. These are inclusive, so they are initially 0 and (length - 1). We also make a new vector to hold our result values.

pub fn spiral_order(matrix: Vec<Vec<i32>>) -> Vec<i32> {
    let mut result: Vec<i32> = Vec::new();
    let mut minR: usize = 0;
    let mut maxR: usize = matrix.len() - 1;
    let mut minC: usize = 0;
    let mut maxC: usize = matrix[0].len() - 1;
    ...
}

Now we want to write a while loop where each iteration processes a single layer. We’ll know we are out of layers if either “minimum” exceeds its corresponding “maximum”. Then we can start penciling in the different cases and phases of the loop. The edge cases occur when a minimum is exactly equal to its maximum. And for the normal case, we’ll do our 4-directional scanning.

pub fn spiral_order(matrix: Vec<Vec<i32>>) -> Vec<i32> {
    let mut result: Vec<i32> = Vec::new();
    let mut minR: usize = 0;
    let mut maxR: usize = matrix.len() - 1;
    let mut minC: usize = 0;
    let mut maxC: usize = matrix[0].len() - 1;
    while (minR <= maxR && minC <= maxC) {
        // Edge cases: single row or single column layers
        if (minR == maxR) {
            ...
            break;
        } else if (minC == maxC) {
            ...
            break;
        }

        // Scan TL->TR
        ...
        // Scan TR->BR
        ...
        // Scan BR->BL
        ...
        // Scan BL->TL
        ...
        
        minR += 1;
        minC += 1;
        maxR -= 1;
        maxC -= 1;
    }
    return result;
}

Our “loop update” step comes at the end, when we increase both minimums, and decrease both maximums. This shows we are shrinking to the next layer.

Now we just have to fill in each case. All of these are scans through some portion of the matrix. The only trick is getting the ranges correct for each scan.

We’ll start with the edge cases. For a single row or column scan, we just need one loop. This loop should be inclusive across its dimension. Rust has a similar range syntax to Haskell, but it is less flexible. We can make a range inclusive by using = before the end element.

pub fn spiral_order(matrix: Vec<Vec<i32>>) -> Vec<i32> {
    ...
    while (minR <= maxR && minC <= maxC) {
        // Edge cases: single row or single column layers
        if (minR == maxR) {
            for i in minC..=maxC {
                result.push(matrix[minR][i]);
            }
            break;
        } else if (minC == maxC) {
            for i in minR..=maxR {
                result.push(matrix[i][minC]);
            }
            break;
        }
        ...
    }
    return result;
}

Now let’s fill in the other cases. Again, getting the right ranges is the most important factor. We also have to make sure we don’t mix up our dimensions or directions! We go right along minR, down along maxC, left along maxR, and then up along minC.

To represent a decreasing range, we have to make the corresponding incrementing range and then use .rev() to reverse it. This is a little inconvenient, giving up ranges that don’t look as nice, like for i in ((minC+1)..=maxC).rev(), because we want the decrementing range to include maxC but exclude minC.

pub fn spiral_order(matrix: Vec<Vec<i32>>) -> Vec<i32> {
    ...
    while (minR <= maxR && minC <= maxC) {
        ...
        // Scan TL->TR
        for i in minC..maxC {
            result.push(matrix[minR][i]);
        }
        // Scan TR->BR
        for i in minR..maxR {
            result.push(matrix[i][maxC]);
        }
        // Scan BR->BL
        for i in ((minC+1)..=maxC).rev() {
            result.push(matrix[maxR][i]);
        }
        // Scan BL->TL
        for i in ((minR+1)..=maxR).rev() {
            result.push(matrix[i][minC]);
        }
        minR += 1;
        minC += 1;
        maxR -= 1;
        maxC -= 1;
    }
    return result;
}

But once these cases are filled in, we’re done! Here’s the full solution:

pub fn spiral_order(matrix: Vec<Vec<i32>>) -> Vec<i32> {
    let mut result: Vec<i32> = Vec::new();
    let mut minR: usize = 0;
    let mut maxR: usize = matrix.len() - 1;
    let mut minC: usize = 0;
    let mut maxC: usize = matrix[0].len() - 1;
    while (minR <= maxR && minC <= maxC) {
        // Edge cases: single row or single column layers
        if (minR == maxR) {
            for i in minC..=maxC {
                result.push(matrix[minR][i]);
            }
            break;
        } else if (minC == maxC) {
            for i in minR..=maxR {
                result.push(matrix[i][minC]);
            }
            break;
        }
        // Scan TL->TR
        for i in minC..maxC {
            result.push(matrix[minR][i]);
        }
        // Scan TR->BR
        for i in minR..maxR {
            result.push(matrix[i][maxC]);
        }
        // Scan BR->BL
        for i in ((minC+1)..=maxC).rev() {
            result.push(matrix[maxR][i]);
        }
        // Scan BL->TL
        for i in ((minR+1)..=maxR).rev() {
            result.push(matrix[i][minC]);
        }
        minR += 1;
        minC += 1;
        maxR -= 1;
        maxC -= 1;
    }
    return result;
}

Haskell Solution

Now let’s write our Haskell solution. We don’t need any fancy mutation tricks here. Our function will just take a 2D array, and return a list of numbers.

spiralMatrix :: A.Array (Int, Int) Int -> [Int]
spiralMatrix = ...
  where
    ((minR', minC'), (maxR', maxC')) = A.bounds arr

Since we used a while loop in our Rust solution, it makes sense that we’ll want to use a raw recursive function that we’ll just call f. Our loop state was the 4 “barrier” values in each dimensions. We’ll also use an accumulator value for our result. Since our barriers are inclusive, we can simply use the bounds of our array for the initial values.

spiralMatrix :: A.Array (Int, Int) Int -> [Int]
spiralMatrix = f minR' minC' maxR' maxC' []
  where
    ((minR', minC'), (maxR', maxC')) = A.bounds arr

    f :: Int -> Int -> Int -> Int -> [Int] -> [Int]
    f = undefined

This recursive function has 3 base cases. First, we have the “loop condition” we used in our Rust solution. If a min dimension value exceeds the max, we are done, and should return our accumulated result list.

Then the other two cases are our edge cases or having a single row or a single column for our final layer. In all these cases, we want to reverse the accumulated list. This means that when we put together our ranges, we want to be careful that they are in reverse order! So the edge cases should start at their max value and decrease to the min value (inclusive).

spiralMatrix :: A.Array (Int, Int) Int -> [Int]
spiralMatrix arr = f minR' minC' maxR' maxC' []
  where
    ((minR', minC'), (maxR', maxC')) = A.bounds arr

    f :: Int -> Int -> Int -> Int -> [Int] -> [Int]
    f minR minC maxR maxC acc
      | minR > maxR || minC > maxC = reverse acc
      | minR == maxR = reverse $ [arr A.! (minR, c) | c <- [maxC,maxC - 1..minC]] <> acc
      | minC == maxC = reverse $ [arr A.! (r, minC) | r <- [maxR,maxR - 1..minR]] <> acc
      | otherwise = ...

Now to fill in the otherwise case, we can do our 4 steps: going right from the top left, then going down from the top right, going left from the bottom right, and going up from the bottom left.

Like the edge cases, we make list comprehensions with ranges to pull the new numbers out of our input matrix. And again, we have to make sure we accumulate them in reverse order. Then we append all of them to the existing accumulation.

spiralMatrix :: A.Array (Int, Int) Int -> [Int]
spiralMatrix arr = f minR' minC' maxR' maxC' []
  where
    ((minR', minC'), (maxR', maxC')) = A.bounds arr

    f :: Int -> Int -> Int -> Int -> [Int] -> [Int]
    f minR minC maxR maxC acc
      ...
      | otherwise =
          let goRights = [arr A.! (minR, c) | c <- [maxC - 1, maxC - 2..minC]]
              goDowns = [arr A.! (r, maxC) | r <- [maxR - 1, maxR - 2..minR]]
              goLefts = [arr A.! (maxR, c) | c <- [minC + 1..maxC]]
              goUps = [arr A.! (r, minC) | r <- [minR+1..maxR]]
              acc' = goUps <> goLefts <> goDowns <> goRights <> acc
          in  f (minR + 1) (minC + 1) (maxR - 1) (maxC - 1) acc'

We conclude by making our recursive call with the updated result list, and shifting the barriers to get to the next layer.

Here’s the full implementation:

spiralMatrix :: A.Array (Int, Int) Int -> [Int]
spiralMatrix arr = f minR' minC' maxR' maxC' []
  where
    ((minR', minC'), (maxR', maxC')) = A.bounds arr

    f :: Int -> Int -> Int -> Int -> [Int] -> [Int]
    f minR minC maxR maxC acc
      | minR > maxR || minC > maxC = reverse acc
      | minR == maxR = reverse $ [arr A.! (minR, c) | c <- [maxC,maxC - 1..minC]] <> acc
      | minC == maxC = reverse $ [arr A.! (r, minC) | r <- [maxR,maxR - 1..minR]] <> acc
      | otherwise =
          let goRights = [arr A.! (minR, c) | c <- [maxC - 1, maxC - 2..minC]]
              goDowns = [arr A.! (r, maxC) | r <- [maxR - 1, maxR - 2..minR]]
              goLefts = [arr A.! (maxR, c) | c <- [minC + 1..maxC]]
              goUps = [arr A.! (r, minC) | r <- [minR+1..maxR]]
              acc' = goUps <> goLefts <> goDowns <> goRights <> acc
          in  f (minR + 1) (minC + 1) (maxR - 1) (maxC - 1) acc'

Conclusion

This is the last matrix-based problem we’ll study for now. Next time we’ll start considering some tree-based problems. If you sign up for our Solve.hs course, you’ll learn about both of these kinds of data structures in Module 2. You’ll implement a tree set from scratch, and you’ll get lots of practice working with these and many other structures. So enroll today!

by James Bowen at July 28, 2025 08:30 AM

GHC Developer Blog

GHC 9.10.3-rc1 is now available

GHC 9.10.3-rc1 is now available

wz1000 - 2025-07-28

The GHC developers are very pleased to announce the availability of the release candidate for GHC 9.10.3. Binary distributions, source distributions, and documentation are available at downloads.haskell.org and via GHCup.

GHC 9.10.3 is a bug-fix release fixing over 50 issues of a variety of severities and scopes. A full accounting of these fixes can be found in the release notes. As always, GHC’s release status, including planned future releases, can be found on the GHC Wiki status.

This release candidate will have a two-week testing period. If all goes well the final release will be available the week of 11 August 2025.

We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable, Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors whose on-going financial and in-kind support has facilitated GHC maintenance and release management over the years. Finally, this release would not have been possible without the hundreds of open-source contributors whose work comprise this release.

As always, do give this release a try and open a ticket if you see anything amiss.

by ghc-devs at July 28, 2025 12:00 AM

July 24, 2025

Tweag I/O

Introduction to the new LaunchDarkly Svelte SDK

Feature flags reduce deployment risk, enable continuous delivery, and create controlled user experiences. As a Svelte enthusiast, I noticed the absence of official LaunchDarkly support for this growing framework, so I built the LaunchDarkly Svelte SDK to fill this gap. In this post, I’ll introduce the SDK and demonstrate how to implement it in a SvelteKit application.

Feature Flags in Frontend Development

Feature flags (or feature toggles) are runtime-controlled switches that let you enable or disable features without unnecessary deployments.

For example, imagine you are working on a new feature that requires significant changes to the UI. By using feature flags, you can deploy the changes to all the environments but only enable the feature in specific ones (like development or uat), or to a subset of users in a single environment (like users on Pro subscription). This allows you to test the feature without exposing it to unintended users, reducing the risk of introducing bugs or breaking changes. And in case things go bad, like a feature is not working as expected, you can easily disable it without having to roll back the entire deployment.

What is LaunchDarkly ?

LaunchDarkly is a feature management platform that provides an easy and scalable way to wrap parts of your code (new features, UI elements, backend changes) in flags so they can be turned on/off without redeploying. It provides a user-friendly dashboard to manage and observe flags, and supports over a dozen SDKs for client/server platforms. In my experience, LaunchDarkly is easier to use — including for non-technical users — and more scalable than most home-grown feature flag solutions.

LaunchDarkly supports targeting and segmentation, so you can control which users see specific features based on things like a user’s location or subscription plan. It also offers solid tooling for running experiments, including A/B testing and progressive rollouts (where a new feature is released to users in stages, rather than all at once). All feature flags can be updated in real-time, meaning that there’s no need for users to refresh the page to see changes.

Those are just my favorites, but if you are interested in learning more about it, LaunchDarkly has a blog post with more information.

Flag Evaluations

LaunchDarkly flags have unique identifiers called flag keys that are defined in the LaunchDarkly dashboard. When you request a flag value, supported client-side SDKs (such as React, iOS, Android, or, now, Svelte) send the flag key along with user information (called the “context”) to LaunchDarkly. LaunchDarkly’s server computes the value of the flag using all the applicable rules (the rules are applied in order) and sends the result back to the app. This process is called flag evaluation. By default, LaunchDarkly uses streaming connections to update flags in real time. This lets you flip flags in the dashboard and see the effect almost instantly in your app.

Svelte in Brief

Svelte is a modern JavaScript framework that I’ve come to appreciate for its performance, simplicity, and excellent developer experience. What I particularly like about Svelte is that it lets you write reactive code directly using standard JavaScript variables, with an intuitive syntax that requires less boilerplate than traditional React or Vue applications. Reactive declarations and stores are built into the framework, so you don’t need Redux or similar external state management libraries for most use cases.

Svelte’s Approach

  • Superior Runtime Performance: Svelte doesn’t rely on virtual DOM. By eliminating the virtual DOM and directly manipulating the real DOM, Svelte can update the UI more quickly and efficiently, resulting in a more responsive application.
  • Faster Load Times: Svelte’s compilation process generates smaller JavaScript bundles and more efficient code, resulting in faster initial page load times compared to frameworks that ship runtime libraries to the browser.

A Simple Example of a Svelte Component

In this example, we define a SimpleCounter component that increments a count when a button is clicked. The count variable is reactive, meaning that any changes to it will automatically update the UI.

// SimpleCounter.svelte
<script lang="ts">
  let count = $state(0);
</script>

<button onclick={() => count++}>
  clicks: {count}
</button>

Now, we can use this component in our application which is in fact another Svelte component. For example: App.svelte:

// App.svelte
<script lang="ts">
  import SimpleCounter from './SimpleCounter.svelte';
</script>

<SimpleCounter />

After doing this, we can end up with something like this:

Simple Counter Demo

Overview of the LaunchDarkly Svelte SDK

Why Use a Dedicated Svelte SDK?

Although LaunchDarkly’s vanilla JavaScript SDK could be used in a Svelte application, this new SDK aligns better with Svelte’s reactivity model and integrates with Svelte-tailored components, allowing us to use LaunchDarkly’s features more idiomatically in our Svelte projects. I originally developed it as a standalone project and then contributed it upstream to be an official part of the LaunchDarkly SDK.

Introduction to LaunchDarkly Svelte SDK

Here are some basic steps to get started with the LaunchDarkly Svelte SDK:

1.Install the Package: First, install the SDK package in your project.

Note: Since the official LaunchDarkly Svelte SDK has not been released yet, for the purposes of this blog post, I’ve created a temporary package available on npm that contains the same code as the official repo. You can still check the official source code in LaunchDarkly’s official repository.

npm install @nosnibor89/svelte-client-sdk

2.Initialize the SDK: Next, you need to initialize the SDK with your LaunchDarkly client-side ID (you need a LaunchDarkly account). This is done using the LDProvider component, which provides the necessary context for feature flag evaluation. Here is an example of how to set it up:

<script lang="ts">
  import { LDProvider } from '@nosnibor89/svelte-client-sdk';
  import MyLayout from './MyLayout.svelte';
</script>

// Use context relevant to your application. More info in https://docs.launchdarkly.com/home/observability/contexts
const context = {
  user: {
    key: 'user-key',
  },
};

<LDProvider clientID="your-client-side-id" {context}>
  <MyLayout />
</LDProvider>

Let’s clarify the code above:

  1. Notice how I wrapped the MyLayout component with the LDProvider component. Usually, you will wrap a high-level component that encompasses most of your application with LDProvider, although it’s up to you and how you want to structure the app.
  2. You can also notice two parameters provided to our LDProvider. The "your-client-side-id" refers to the LaunchDarkly Client ID and the context object refers to the LaunchDarkly Context used to evaluate feature flags. This is necessary information we need to provide for the SDK to work correctly.

3.Evaluate a flag: The SDK provides the LDFlag component for evaluating your flag1. This component covers a common use case where you want to render different content based on the state of a feature flag. By default, LDFlag takes a boolean flag but can be extended to work with the other LaunchDarkly flag types as well.

<script lang="ts">
 import { LDFlag } from '@nosnibor89/svelte-client-sdk';
</script>

<LDFlag flag={'my-feature-flag'}>
  {#snippet on()}
    <p>renders if flag evaluates to true</p>
  {/snippet}
  {#snippet off()}
    <p>renders if flag evaluates to false</p>
  {/snippet}
</LDFlag>

In this example, the LDFlag component will render the content inside the on snippet2 if the feature flag my-feature-flag evaluates to true. If the flag evaluates to false, the content inside the off snippet will be rendered instead.

Building an application with SvelteKit

Now that we have seen the basics of how to use the LaunchDarkly Svelte SDK, let’s see how we can put everything together in a real application.

For the sake of brevity, I’ll be providing the key source code in this example, but if you are curious or need help, you can check out the full source code in Github.

How the app works

This is a simple ‘movies’ app where the main page displays a list of movies in a card format with a SearchBar component at the top. This search bar allows users to filter movies based on the text entered.

App Demo

The scenario we’re simulating is that Product Owners want to replace the traditional search bar with a new AI-powered assistant that helps users get information about specific movies. This creates a perfect use case for feature flags and can be described as follows:

Feature Flag Scenarios

  1. SearchBar vs AI Assistant: We’ll use a boolean feature flag to determine whether to display the classic SearchBar component or the new MoviesSmartAssistant3 component - simulating a simple all-at-once release.

  2. AI Model Selection: We’ll use a JSON feature flag to determine which AI model (GPT or Gemini) the MoviesSmartAssistant will use. This includes details about which model to use for specific users, along with display information like labels. This simulates a progressive rollout where Product Owners can gather insights on which model performs better.

Prerequisites

To follow along, you’ll need:

  1. A LaunchDarkly account
  2. A LaunchDarkly Client ID (Check this guide to get it)
  3. Two feature flags (see the creating new flags guide): a boolean flag (show-movie-smart-assistant) and a JSON flag (smart-assistant-config) looking like this:
    {
      "model": "gpt-4",
      "label": "Ask GPT-4 anything"
    }
  4. A SvelteKit4 application (create with npx sv create my-app)

Integrating the LaunchDarkly Svelte SDK

After creating the project, a SvelteKit application was scaffolded for you, meaning you should have a src directory where your application code resides. Inside this folder, you will find a routes directory, which is where SvelteKit handles routing. More specifically, there are two files: +layout.svelte and +page.svelte which are the main files we are going to highlight in this post.

Setting up the layout

// src/routes/+layout.svelte
<script lang="ts">
  import "../app.css";
  import { LDProvider } from "@nosnibor89/svelte-client-sdk";
  import { PUBLIC_LD_CLIENT_ID } from '$env/static/public';
  import LoadingSpinner from "$lib/LoadingSpinner.svelte"; // Check source code in Github https://github.com/tweag/blog-resources/blob/master/launchdarkly-svelte-sdk-intro/src/lib/LoadingSpinner.svelte

  let { children } = $props();

  // random between 0 and 1
  const orgId = Math.round(Math.random());

  const orgKey = `sdk-example-org-${orgId}`


  const ldContext = {
    kind: "org",
    key: orgKey,
  };

</script>

<LDProvider clientID={PUBLIC_LD_CLIENT_ID} context={ldContext}>
  {#snippet initializing()}
    <div class="...">
      <LoadingSpinner message={"Loading flags"}/>
    </div>
  {/snippet}

  {@render children()}
</LDProvider>

Let’s analyze this:

  1. We are importing the LDProvider component from the LaunchDarkly Svelte SDK and wrapping our layout with it. In SvelteKit, the layout will act as the entry point for our application, so it’s a good place for us to initialize the SDK allowing us to use other members of the SDK in pages or child components.
  2. We are also importing the PUBLIC_LD_CLIENT_ID variable from the environment variables. You can set this variable in your .env file at the root of the project (this is a SvelteKit feature).
  3. Another thing to notice is that we are using a LoadingSpinner component while the SDK is initializing. This is optional and is a good place to provide feedback to the user while the SDK is loading and feature flags are being evaluated for the first time. Also, don’t worry about the code for LoadingSpinner, you can find it in the source code on Github.

Add the movies page

At this point, we are ready to start evaluating flags, so let’s now go ahead and add our page where the SDK will help us accomplish scenarios 1 and 2.

Movies Page (SearchBar vs AI Assistant)

The movies page is the main and only page of our application. It displays a list of movies along with a search bar. This is the part where we will evaluate our first feature flag to switch between the SearchBar and the MoviesSmartAssistant components.

// src/routes/+page.svelte
<script lang="ts">
  // ...some imports hidden for brevity. Check source code on Github
  import SearchBar from "$lib/SearchBar.svelte";
  import MoviesSmartAssistant from "$lib/MoviesSmartAssistant.svelte";
  import { LD, LDFlag } from "@nosnibor89/svelte-client-sdk";

  let searchQuery = $state("");
  let prompt = $state("");
  const flagKey = "show-movie-smart-assistant";
  const flagValue = LD.watch(flagKey);
  flagValue.subscribe((value) => {
    // remove search query or prompt when flag changes
      searchQuery = "";
      prompt = "";
  });

  // ...rest of the code hidden for brevity. Check source code on Github
  // https://github.com/tweag/blog-resources/blob/master/launchdarkly-svelte-sdk-intro/src/routes/%2Bpage.svelte

</script>

<div class="...">
  <LDFlag flag={flagKey}>
    {#snippet on()}
      <MoviesSmartAssistant
        prompt={prompt}
        onChange={handlePromptChange}
        onSubmit={handleSendPrompt}
      />
    {/snippet}
    {#snippet off()}
      <SearchBar value={searchQuery} onSearch={handleSearch} />
    {/snippet}
  </LDFlag>

  <div
    class="..."
  >
    {#each filteredMovies as movie}
      <MovieCard {movie} />
    {/each}
  </div>
</div>

Again, let’s break this down:

  1. We are using the LDFlag component from the SDK. It will allow us to determine which component to render based on the state of the show-movie-smart-assistant feature flag. When the flag evaluates to true, the on snippet will run, meaning the MoviesSmartAssistant component will be rendered, and when the flag evaluates to false, the off snippet will run, meaning the SearchBar component will be rendered.
  2. We are also using the LD.watch function. This is useful when you need to get the state of a flag and keep track of it. In this case, we are simply resetting the search query or prompt so that the user can start fresh when the flag changes.
  3. The rest of the code you are not seeing is just functionality for the filtering mechanism and the rest of the presentational components. Remember you can find the code for those on Github.

MoviesSmartAssistant Component (AI Model Selection)

Whenever our MoviesSmartAssistant component is rendered, we want to check the value of the smart-assistant-config feature flag to determine which AI model to use for the assistant.

// src/lib/MoviesSmartAssistant.svelte
<script lang="ts">
  import { LD } from "@nosnibor89/svelte-client-sdk";
  import type { Readable } from "svelte/store";

  type MoviesSmartAssistantConfig = { model: string; label: string;};
  const smartAssistantConfig = LD.watch("smart-assistant-config") as Readable<MoviesSmartAssistantConfig>;
  // ... rest of the code hidden for brevity. Check source code on Github
  // https://github.com/tweag/blog-resources/blob/master/launchdarkly-svelte-sdk-intro/src/lib/MoviesSmartAssistant.svelte
</script>

<div class="...">
  <input
    type="text"
    placeholder={$smartAssistantConfig?.label ?? "Ask me anything..."}
    value={prompt}
    oninput={handleInput}
    class="..."
  />
  <button type="button" onclick={handleClick} aria-label="Submit">
    // ...svg code hidden for brevity
  </button>
</div>

As before, I’m hiding some code for brevity, but here are the key points:

  1. We are using the LD.watch method to watch for changes in the smart-assistant-config feature flag which contains information about the AI model. This will allow us to use the proper model for a given user based on the flag evaluation.
  2. Notice how the SDK understands it’s a JSON flag and returns a Javascript object (with a little help5) as we defined in the LaunchDarkly dashboard.

Running the Application

Now that we have everything set up, let’s run the application. Here we are going to use the Client ID and set it as an environment variable.

PUBLIC_LD_CLIENT_ID={your_client_id} npm run dev

Open your browser and navigate to http://localhost:5173 (check your terminal as it may run at a different port). You should see the movies application with either the SearchBar or MoviesSmartAssistant component depending on your feature flag configuration.

Seeing Feature Flags in Action

If you were able to correctly set everything up, you should be able to interact with the application and LaunchDarkly Dashboard by toggling the feature flags and validating the behavior of the application.

I have included this demo video to show you how the application works and how the feature flags are being evaluated.

Conclusion

We just saw how to use the LaunchDarkly Svelte SDK and integrate it into a SvelteKit application using a realistic example. I hope this post gave you an understanding of the features the SDK provides and also what it lacks while being in its early stages and while awaiting the official release.

For now, my invitation for you is to try the SDK yourself and explore different use cases. For example, change the context with LD.identify to simulate users signing in to an application, or maybe try a different flag type like a string or number flag. Also, stay tuned for updates on the official LaunchDarkly Svelte SDK release.


  1. LDFlag is a key component but there are other ways to evaluate a flag using the SDK.
  2. Snippets are a Svelte feature and can also be named slots. Check out https://svelte.dev/docs/svelte/snippet
  3. The MoviesSmartAssistant component is just a visual representation without actual AI functionality — my focus is on demonstrating how the LaunchDarkly Svelte SDK enables these feature flag implementations.
  4. SvelteKit is the official application framework for Svelte. It comes with out-of-the-box support for TypeScript, server-side rendering, and automatic routing through file-based organization.
  5. Ok, I’m also using TypeScript here to hint the type of the object returned by the LD.watch method. Maybe this is something to fix in the future.

July 24, 2025 12:00 AM

July 23, 2025

Well-Typed.Com

Pure parallelism (Haskell Unfolder #47)

Today, 2025-07-23, at 1830 UTC (11:30 am PDT, 2:30 pm EDT, 7:30 pm GMT, 20:30 CET, …) we are streaming the 47th episode of the Haskell Unfolder live on YouTube.

Pure parallelism (Haskell Unfolder #47)

“Pure parallelism” refers to the execution of pure Haskell functions on multiple CPU cores, (hopefully) speeding up the computation. Since we are still dealing with pure functions, however, we get none of the problems normally associated with concurrent execution: no non-determinism, no need for locks, etc. In this episode we will develop a pure but parallel implementation of linear regression. We will briefly recap how linear regression works, before discussing the two primitive functions that Haskell offers for pure parallelism: par and pseq.

About the Haskell Unfolder

The Haskell Unfolder is a YouTube series about all things Haskell hosted by Edsko de Vries and Andres Löh, with episodes appearing approximately every two weeks. All episodes are live-streamed, and we try to respond to audience questions. All episodes are also available as recordings afterwards.

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

There’s now also a web shop where you can buy t-shirts and mugs (and potentially in the future other items) with the Haskell Unfolder logo.

by andres, edsko at July 23, 2025 12:00 AM

July 21, 2025

Monday Morning Haskell

Image Rotation: Mutable Arrays in Haskell

In last week’s article, we took our first step into working with multi-dimensional arrays. Today, we’ll be working with another Matrix problem that involves in-place mutation. The Haskell solution uses the MArray interface, which takes us out of our usual

The MArray interface is a little tricky to work with. If you want a full overview of the API, you should sign up for our Solve.hs course, where we cover mutable arrays in module 2!

The Problem

Today’s problem is Rotate Image. We’re going to take a 2D Matrix of integer values as our input and rotate the matrix 90 degrees clockwise. We must accomplish this in place, modifying the input value without allocating a new Matrix. The input matrix is always “square” (n x n).

Here are a few examples to illustrate the idea. We can start with a 2x2 matrix:

1  2   |   3  1
3  4   |   4  2

The 4x4 rotation makes it more clear that we’re not just moving numbers one space over. Each corner element will go to a new corner. You can also see how the inside of the matrix is also rotating:

1  2  3  4    |  13  9  5  1
5  6  7  8    |  14 10  6  2
9  10 11 12   |  15 11  7  3
13 14 15 16   |  16 12  8  4

The 3x3 version shows how with an odd number of rows and columns, the inner most number will stand still.

1  2  3   |   7  4  1
4  5  6   |   8  5  2
7  8  9   |   9  6  3

The Algorithm

While this problem might be a little intimidating at first, we just have to break it into sufficiently small and repeatable pieces. The core step is that we swap four numbers into each other’s positions. It’s easy to see, for example, that the four corners always trade places with one another (1, 4, 13, 16 in the 4x4 example).

What’s important is seeing the other sets of 4. We move clockwise to get the next 4 values:

  1. The value to the right of the top left corner
  2. The value below the top right corner
  3. The value to the left of the bottom right corner
  4. The value above the bottom left corner.

So in the 4x4 example, these would be 2, 8, 15, 9. Then another group is 3, 12, 14, 15.

Those 3 groups are all the rotations we need for the “outer layer”. Then we move to the next layer, where we have a single group of 4: 6, 7, 10, 11.

This should tell us that we have a 3-step process:

  1. Loop through each layer of the matrix
  2. Identify all groups of 4 in this layer
  3. Rotate each group of 4

It helps to put a count on the size of each of these loops. For an n x n matrix, the number of layers to rotate is n / 2, rounded down, because the inner-most layer needs no rotation in an odd-sized matrix.

Then for a layer spanning from column c1 to c2, the number of groups in that layer is just c2 - c1. So for the first layer in a 4x4, we span columns 0 to 3, and there are 3 groups of 4. In the inner layer, we span columns 1 to 2, so there is only 1 group of 4.

Rust Solution

As is typical, we’ll see more of a loop structure in our Rust code, and a recursive version of this solution in Haskell. We’ll also start by defining various terms we’ll use. There are multiple ways to approach the details of this problem, but we’ll take an approach that maximizes the clarity of our inner loops.

We’ll define each “layer” using the four corner coordinates of that layer. So for an n x n matrix, these are (0,0), (0, n - 1), (n - 1, n - 1), (n - 1, 0). After we finish looping through a layer, we can simply increment/decrement each of these values as appropriate to get the corner coordinates of the next layer ((1,1), (1, n - 2), etc.).

So let’s start our solution by defining the 8 mutable values for these 4 corners. Each corner (top/left/bottom/right) has a row R and column C value.

pub fn rotate(matrix: &mut Vec<Vec<i32>>) {
    let n = matrix.len();
    let numLayers = n / 2;
    let mut topLeftR = 0;
    let mut topLeftC = 0;
    let mut topRightR = 0;
    let mut topRightC = n - 1;
    let mut bottomRightR = n - 1;
    let mut bottomRightC = n - 1;
    let mut bottomLeftR = n - 1;
    let mut bottomLeftC = 0;
    ...
}

It would be possible to solve the problem without these values, determining coordinates using the layer number. But I’ve found this to be somewhat more error prone, since we’re constantly adding and subtracting from different coordinates in different combinations. We get the number of layers from n / 2.

Now let’s frame the outer loop. We conclude the loop by modifying each coordinate point. Then at the beginning of the loop, we can determine the number of “groups” for the layer by taking the difference between the left and right column coordinates.

pub fn rotate(matrix: &mut Vec<Vec<i32>>) {
    ...
    for i in 0..numLayers {
        let numGroups = topRightC - topLeftC;

        for j in 0..numGroups {
            ...
        }

        topLeftR += 1;
        topLeftC += 1;
        topRightR += 1;
        topRightC -= 1;
        bottomRightR -= 1;
        bottomRightC -= 1;
        bottomLeftR -= 1;
        bottomLeftC += 1;
    }
}

Now we just need the logic for rotating a single group of 4 points. This is a 5-step process:

  1. Save top left value as temp
  2. Move bottom left to top left
  3. Move bottom right to bottom left
  4. Move top right to bottom right
  5. Move temp (original top left) to top right

Unlike the layer number, we’ll use the group variable j for arithmetic here. When you’re writing this yourself, it’s important to go slowly to make sure you’re using the right corner values and adding/subtracting j from the correct dimension.

pub fn rotate(matrix: &mut Vec<Vec<i32>>) {
    ...
    for i in 0..numLayers {
        let numGroups = topRightC - topLeftC;

        for j in 0..numGroups {
            let temp = matrix[topLeftR][topLeftC + j];
            matrix[topLeftR][topLeftC + j] = matrix[bottomLeftR - j][bottomLeftC];
            matrix[bottomLeftR - j][bottomLeftC] = matrix[bottomRightR][bottomRightC - j];
            matrix[bottomRightR][bottomRightC - j] = matrix[topRightR + j][topRightC];
            matrix[topRightR + j][topRightC] = temp;
        }

        ... // (update corners)
    }
}

And then we’re done! We don’t actually need to return a value since we’re just modifying the input in place. Here’s the full solution:

pub fn rotate(matrix: &mut Vec<Vec<i32>>) {
    let n = matrix.len();
    let numLayers = n / 2;
    let mut topLeftR = 0;
    let mut topLeftC = 0;
    let mut topRightR = 0;
    let mut topRightC = n - 1;
    let mut bottomRightR = n - 1;
    let mut bottomRightC = n - 1;
    let mut bottomLeftR = n - 1;
    let mut bottomLeftC = 0;
    for i in 0..numLayers {
        let numGroups = topRightC - topLeftC;

        for j in 0..numGroups {
            let temp = matrix[topLeftR][topLeftC + j];
            matrix[topLeftR][topLeftC + j] = matrix[bottomLeftR - j][bottomLeftC];
            matrix[bottomLeftR - j][bottomLeftC] = matrix[bottomRightR][bottomRightC - j];
            matrix[bottomRightR][bottomRightC - j] = matrix[topRightR + j][topRightC];
            matrix[topRightR + j][topRightC] = temp;
        }

        topLeftR += 1;
        topLeftC += 1;
        topRightR += 1;
        topRightC -= 1;
        bottomRightR -= 1;
        bottomRightC -= 1;
        bottomLeftR -= 1;
        bottomLeftC += 1;
    }
}

Haskell Solution

This is an interesting problem to solve in Haskell because Haskell is a generally immutable language. Unlike Rust, we can’t make values mutable just by putting the keyword mut in front of them.

With arrays, we can modify them in place though using the MArray monad class. We won’t go through all the details of the interface in this article (you can learn about all that in Solve.hs Module 2). But we’ll start with the type signature:

rotateImage :: (MArray array Int m) => array (Int, Int) Int -> m ()

This tells us we are taking a mutable array, where the array type is polymorphic but tied to the monad m. For example, IOArray would work with the IO monad. We don’t return anything, because we’re modifying our input.

We still begin our function by defining terms, but now we need to use monadic actions to retrieve even the bounds our our array.

rotateImage :: (MArray array Int m) => array (Int, Int) Int -> m ()
rotateImage arr = do
  ((minR, minC), (maxR, maxC)) <- getBounds arr
  let n = maxR - minR + 1
  let numLayers = n `quot` 2
  ...

Our algorithm has two loop levels. The outer loop goes through the different layers of the matrix. The inner layer goes through each group of 4 within the layer. In Haskell, both of these loops are recursive, monadic functions. Our Rust loops treat the four corner points of the layer as stateful values, so these need to be inputs to our recursive functions. In addition, each function will take the layer/group number as an input.

rotateImage :: (MArray array Int m) => array (Int, Int) Int -> m ()
rotateImage arr = do
  ((minR, minC), (maxR, maxC)) <- getBounds arr
  let n = maxR - minR + 1
  let numLayers = n `quot` 2
  ...
  where
    rotateLayer tl@(tlR, tlC) tr@(trR, trC) br@(brR, brC) bl@(blR, blC) n = ...
    
    rotateGroup (tlR, tlC) (trR, trC) (brR, brC) (blR, blC) j = ...

Now we just have to fill in these functions. For rotateLayer, we use the “layer number” parameter as a countdown. Once it reaches 0, we’ll be done. We just need to determine the number of groups in this layer using the column difference of left and right. Then we’ll call rotateGroup for each group.

We make the first call to rotateLayer with numLayers and the original corners, coming from our dimensions. When we recurse, we add/subtract 1 from the corner dimensions, and subtract 1 from the layer number.

rotateImage :: (MArray array Int m) => array (Int, Int) Int -> m ()
rotateImage arr = do
  ((minR, minC), (maxR, maxC)) <- getBounds arr
  let n = maxR - minR + 1
  let numLayers = n `quot` 2
  rotateLayer (minR, minC) (minR, maxC) (maxR, maxC) (maxR, minC) numLayers
  where
    rotateLayer _ _ _ _ 0 = return ()
    rotateLayer tl@(tlR, tlC) tr@(trR, trC) br@(brR, brC) bl@(blR, blC) n = do
      let numGroups = ([0..(trC - tlC - 1)] :: [Int])
      forM_ numGroups (rotateGroup tl tr br bl)
      rotateLayer (tlR + 1, tlC + 1) (trR + 1, trC - 1) (brR - 1, brC - 1) (blR - 1, blC + 1) (n - 1)
    
    rotateGroup (tlR, tlC) (trR, trC) (brR, brC) (blR, blC) j = ...

And how do we rotate a group? We use the same five steps we took in Rust. We save the top left as temp and then move the values around. We use the monadic functions readArray and writeArray to perform these actions in place on our Matrix.

rotateImage :: (MArray array Int m) => array (Int, Int) Int -> m ()
rotateImage arr = do
  ...
  where
    ...
    
    rotateGroup (tlR, tlC) (trR, trC) (brR, brC) (blR, blC) j = do
      temp <- readArray arr (tlR, tlC + j)
      readArray arr (blR - j, blC) >>= writeArray arr (tlR, tlC + j)
      readArray arr (brR, brC - j) >>= writeArray arr (blR - j, blC)
      readArray arr (trR + j, trC) >>= writeArray arr (brR, brC - j)
      writeArray arr (trR + j, trC) temp

Here’s the full implementation:

rotateImage :: (MArray array Int m) => array (Int, Int) Int -> m ()
rotateImage arr = do
  ((minR, minC), (maxR, maxC)) <- getBounds arr
  let n = maxR - minR + 1
  let numLayers = n `quot` 2
  rotateLayer (minR, minC) (minR, maxC) (maxR, maxC) (maxR, minC) numLayers
  where
    rotateLayer _ _ _ _ 0 = return ()
    rotateLayer tl@(tlR, tlC) tr@(trR, trC) br@(brR, brC) bl@(blR, blC) n = do
      let numGroups = ([0..(trC - tlC - 1)] :: [Int])
      forM_ numGroups (rotateGroup tl tr br bl)
      rotateLayer (tlR + 1, tlC + 1) (trR + 1, trC - 1) (brR - 1, brC - 1) (blR - 1, blC + 1) (n - 1)
    
    rotateGroup (tlR, tlC) (trR, trC) (brR, brC) (blR, blC) j = do
      temp <- readArray arr (tlR, tlC + j)
      readArray arr (blR - j, blC) >>= writeArray arr (tlR, tlC + j)
      readArray arr (brR, brC - j) >>= writeArray arr (blR - j, blC)
      readArray arr (trR + j, trC) >>= writeArray arr (brR, brC - j)
      writeArray arr (trR + j, trC) temp

Conclusion

We’ve got one more Matrix problem to solve next time, and then we’ll move on to some other data structures. To learn more about using Data Structures and Algorithms in Haskell, you take our Solve.hs course. You’ll get the chance to write a number of data structures from scratch, and you’ll get plenty of practice working with them and using them in algorithms!

by James Bowen at July 21, 2025 08:30 AM

July 18, 2025

Brent Yorgey

Competitive programming in Haskell: sparse tables

Competitive programming in Haskell: sparse tables

Continuing a series of posts on techniques for calculating range queries, today I will present the sparse table data structure, for doing fast range queries on a static sequence with an idempotent combining operation.

Motivation

In my previous post, we saw that if we have a static sequence and a binary operation with a group structure (i.e. every element has an inverse), we can precompute a prefix sum table in \(O(n)\) time, and then use it to answer arbitrary range queries in \(O(1)\) time.

What if we don’t have inverses? We can’t use prefix sums, but can we do something else that still allows us to answer range queries in \(O(1)\)? One thing we could always do would be to construct an \(n \times n\) table storing the answer to every possible range query—that is, \(Q[i,j]\) would store the value of the range \(a_i \diamond \dots \diamond a_j\). Then we could just look up the answer to any range query in \(O(1)\). Naively computing the value of each \(Q[i,j]\) would take \(O(n)\) time, for a total of \(O(n^3)\) time to fill in each of the entries in the tableWe only have to fill in \(Q[i,j]\) where \(i < j\), but this is still about \(n^2/2\) entries.

, though it’s not too hard to fill in the table in \(O(n^2)\) total time, spending only \(O(1)\) to fill in each entry—I’ll leave this to you as an exercise.

However, \(O(n^2)\) is often too big. Can we do better? More generally, we are looking for a particular subset of range queries to precompute, such that the total number is asymptotically less than \(n^2\), but we can still compute the value of any arbitrary range query by combining some (constant number of) precomputed ranges. In the case of a group structure, we were able to compute the values for only prefix ranges of the form \(1 \dots k\), then compute the value of an arbitrary range using two prefixes, via subtraction.

A sparse table is exactly such a scheme for precomputing a subset of ranges.In fact, I believe, but do not know for sure, that this is where the name “sparse table” comes from—it is “sparse” in the sense that it only stores a sparse subset of range values.

Rather than only a linear number of ranges, as with prefix sums, we have to compute \(O(n \lg n)\) of them, but that’s still way better than \(O(n^2)\). Note, however, that a sparse table only works when the combining operation is idempotent, that is, when \(x \diamond x = x\) for all \(x\). For example, we can use a sparse table with combining operations such as \(\max\) or \(\gcd\), but not with \(+\) or \(\times\). Let’s see how it works.

Sparse tables

The basic idea behind a sparse table is that we precompute a series of “levels”, where level \(i\) stores values for ranges of length \(2^i\). So level \(0\) stores “ranges of length \(1\)”—that is, the elements of the original sequence; level \(1\) stores ranges of length \(2\); level \(2\) stores ranges of length \(4\); and so on. Formally, \(T[i,j]\) stores the value of the range of length \(2^i\) starting at index \(j\). That is,

\[T[i,j] = a_j \diamond \dots \diamond a_{j+2^i-1}.\]

We can see that \(i\) only needs to go from \(0\) up to \(\lfloor \lg n \rfloor\); above that and the stored ranges would be larger than the entire sequence. So this table has size \(O(n \lg n)\).

Two important questions remain: how do we compute this table in the first place? And once we have it, how do we use it to answer arbitrary range queries in \(O(1)\)?

Computing the table is easy: each range on level \(i\), of length \(2^i\), is the combination of two length-\(2^{i-1}\) ranges from the previous level. That is,

\[T[i,j] = T[i-1, j] \diamond T[i-1, j+2^{i-1}]\]

The zeroth level just consists of the elements of the original sequence, and we can compute each subsequent level using values from the previous level, so we can fill in the entire table in \(O(n \lg n)\) time, doing just a single combining operation for each value in the table.

Once we have the table, we can compute the value of an arbitrary range \([l,r]\) as follows:

  • Compute the biggest power of two that fits within the range, that is, the largest \(k\) such that \(2^k \leq r - l + 1\). We can compute this simply as \(\lfloor \lg (r - l + 1) \rfloor\).

  • Look up two range values of length \(2^k\), one for the range which begins at \(l\) (that is, \(T[k, l]\)) and one for the range which ends at \(r\) (that is, \(T[k, r - 2^k + 1]\)). These two ranges overlap; but because the combining operation is idempotent, combining the values of the ranges yields the value for our desired range \([l,r]\).

    This is why we require the combining operation to be idempotent: otherwise the values in the overlap would be overrepresented in the final, combined value.

Haskell code

Let’s write some Haskell code! First, a little module for idempotent semigroups. Note that we couch everything in terms of semigroups, not monoids, because we have no particular need of an identity element; indeed, some of the most important examples like \(\min\) and \(\max\) don’t have an identity element. The IdempotentSemigroup class has no methods, since as compared to Semigroup it only adds a law. However, it’s still helpful to signal the requirement. You might like to convince yourself that all the instances listed below really are idempotent.

module IdempotentSemigroup where

import Data.Bits
import Data.Semigroup

-- | An idempotent semigroup is one where the binary operation
--   satisfies the law @x <> x = x@ for all @x@.
class Semigroup m => IdempotentSemigroup m

instance Ord a => IdempotentSemigroup (Min a)
instance Ord a => IdempotentSemigroup (Max a)
instance IdempotentSemigroup All
instance IdempotentSemigroup Any
instance IdempotentSemigroup Ordering
instance IdempotentSemigroup ()
instance IdempotentSemigroup (First a)
instance IdempotentSemigroup (Last a)
instance Bits a => IdempotentSemigroup (And a)
instance Bits a => IdempotentSemigroup (Ior a)
instance (IdempotentSemigroup a, IdempotentSemigroup b) => IdempotentSemigroup (a,b)
instance IdempotentSemigroup b => IdempotentSemigroup (a -> b)

Now, some code for sparse tables. First, a few imports.

{-# LANGUAGE TupleSections #-}

module SparseTable where

import Data.Array (Array, array, (!))
import Data.Bits (countLeadingZeros, finiteBitSize, (!<<.))
import IdempotentSemigroup

The sparse table data structure itself is just a 2D array over some idempotent semigroup m. Note that UArray would be more efficient, but (1) that would make the code for building the sparse table more annoying (more on this later), and (2) it would require a bunch of tedious additional constraints on m.

newtype SparseTable m = SparseTable (Array (Int, Int) m)
  deriving (Show)

We will frequently need to compute rounded-down base-two logarithms, so we define a function for it. A straightforward implementation would be to repeatedly shift right by one bit and count the number of shifts needed to reach zero; however, there is a better way, using Data.Bits.countLeadingZeros. It has a naive default implementation which counts right bit shifts, but in most cases it compiles down to much more efficient machine instructions.

-- | Logarithm base 2, rounded down to the nearest integer.  Computed
--   efficiently using primitive bitwise instructions, when available.
lg :: Int -> Int
lg n = finiteBitSize n - 1 - countLeadingZeros n

Now let’s write a function to construct a sparse table, given a sequence of values. Notice how the sparse table array st is defined recursively. This works because the Array type is lazy in the stored values, with the added benefit that only the array values we end up actually needing will be computed. However, this comes with a decent amount of overhead. If we wanted to use an unboxed array instead, we wouldn’t be able to use the recursive definition trick; instead, we would have to use an STUArray and fill in the values in a specific order. The code for this would be longer and much more tedious, but could be faster if we end up needing all the values in the array anyway.

-- | Construct a sparse table which can answer range queries over the
--   given list in $O(1)$ time.  Constructing the sparse table takes
--   $O(n \lg n)$ time and space, where $n$ is the length of the list.
fromList :: IdempotentSemigroup m => [m] -> SparseTable m
fromList ms = SparseTable st
 where
  n = length ms
  lgn = lg n

  st =
    array ((0, 0), (lgn, n - 1)) $
      zip ((0,) <$> [0 ..]) ms
        ++ [ ((i, j), st ! (i - 1, j) <> st ! (i - 1, j + 1 !<<. (i - 1)))
           | i <- [1 .. lgn]
           , j <- [0 .. n - 1 !<<. i]
           ]

Finally, we can write a function to answer range queries.

-- | \$O(1)$. @range st l r@ computes the range query which is the
--   @sconcat@ of all the elements from index @l@ to @r@ (inclusive).
range :: IdempotentSemigroup m => SparseTable m -> Int -> Int -> m
range (SparseTable st) l r = st ! (k, l) <> st ! (k, r - (1 !<<. k) + 1)
 where
  k = lg (r - l + 1)

Applications

Most commonly, we can use a sparse table to find the minimum or maximum values on a range, \(\min\) and \(\max\) being the quintessential idempotent operations. For example, this plays a key role in a solution to the (quite tricky) problem Ograda.At first it seemed like that problem should be solvable with some kind of sliding window approach, but I couldn’t figure out how to make it work!

What if we want to find the index of the minimum or maximum value in a given range (see, for example, Worst Weather)? We can easily accomplish this using the semigroup Min (Arg m i) (or Max (Arg m i)), where m is the type of the values and i is the index type. Arg, from Data.Semigroup, is just a pair which uses only the first value for its Eq and Ord instances, and carries along the second value (which is also exposed via Functor, Foldable, and Traversable instances). In the example below, we can see that the call to range st 0 3 returns both the max value on the range (4) and its index (2) which got carried along for the ride:

λ> :m +Data.Semigroup
λ> st = fromList (map Max (zipWith Arg [2, 3, 4, 2, 7, 4, 9] [0..]))
λ> range st 0 3
Max {getMax = Arg 4 2}

Finally, I will mention that being able to compute range minimum queries is one way to compute lowest common ancestors for a (static, rooted) tree. First, walk the tree via a depth-first search and record the depth of each node encountered in sequence, a so-called Euler tour (note that you must record every visit to a node—before visiting any of its children, in between each child, and after visiting all the children). Now the minimum depth recorded between visits to any two nodes will correspond to their lowest common ancestor.

Here are a few problems that involve computing least common ancestors in a tree, though note there are also other techniques for computing LCAs (such as binary jumping) which I plan to write about eventually.

<noscript>Javascript needs to be activated to view comments.</noscript>

by Brent Yorgey at July 18, 2025 12:00 AM

July 16, 2025

Stackage Blog

LTS 24 release for ghc-9.10 and Nightly now on ghc-9.12

Stackage LTS 24 has been released

The Stackage team is happy to announce that Stackage LTS version 24 has finally been released a couple of days ago, based on GHC stable version 9.10.2.

LTS 24 includes many package changes, and over 3400 packages! Thank you for all your nightly contributions that made this release possible: the initial release was prepared by Mihai Maruseac. The closest nightly snapshot to lts-24.0 is nightly-2025-07-13.

If your package is missing from LTS 24 and can build there, you can easily have it added by opening a PR in lts-haskell to the build-constraints/lts-24-build-constraints.yaml file.

Stackage Nightly updated to ghc-9.12.2

At the same time we are excited to move Stackage Nightly to GHC 9.12.2: the initial snapshot release is nightly-2025-07-15. Current nightly has over 3100 packages, and we expect that number to grow over the coming weeks and months: we welcome your contributions and help with this. This initial release build was made by Jens Petersen (31 commits).

A number of packages have been disabled, with the switch to a new GHC version. You can see all the changes made relative to the preceding last 9.10 nightly snapshot. Apart from trying to build yourself, the easiest way to understand why particular packages are disabled is to look for their < 0 lines in build-constraints.yaml, particularly under the "Library and exe bounds failures" section. We also have some tracking issues still open related to 9.12 core boot libraries.

Thank you to all those who have already done work updating their packages for ghc-9.12.

Adding or enabling your package for Nightly is just a simple pull request to the large build-constraints.yaml file.

If you have questions, you can ask in Stack and Stackage Matrix room (#haskell-stack:matrix.org) or Slack channel.

July 16, 2025 07:00 AM

July 10, 2025

Tweag I/O

Publish all your crates everywhere all at once

Cargo is the native package manager and build system for Rust, allowing you to easily bring in dependencies from the global crates.io registry,1 or to publish your own crates to crates.io. Tor Hovland and I recently contributed a long-requested feature to Cargo, allowing you to package many interdependent packages in one go. That might not sound like a big deal, but there were a few tricky parts; there’s a reason the original feature request was open for more than 10 years! In this post, I’ll walk you through the feature and — if you’re a Rust developer — tell you how you can try it out.

Workspaces

The Rust unit of packaging — like a gem in Ruby or a module in Go — is called a “crate”, and it’s pretty common for a medium-to-large Rust project to be divided into several of them. This division helps keep code modular and interfaces well-defined, and also allows you to build and test components individually. Cargo supports multi-crate workflows using “workspaces”: a workspace is just a bunch of crates that Cargo handles “together”, sharing a common dependency tree, a common build directory, and so on. A basic workspace might look like this:

.
├── Cargo.toml
├── Cargo.lock
├── taco
│   ├── Cargo.toml
│   └── src
│       ├── lib.rs
│       └── ... more source files
└── tortilla
    ├── Cargo.toml
    └── src
        ├── lib.rs
        └── ... more source files

The top-level Cargo.toml just tells Cargo where the crates in the workspace live.2

# ./Cargo.toml
workspace.members = ["taco", "tortilla"]

The crate-level Cargo.toml files tell us about the crates (surprise!). Here’s taco’s Cargo.toml:

# ./taco/Cargo.toml
[package]
name = "taco"
version = "2.0"
dependencies.tortilla = { path = "../tortilla", version = "1.3" }

The dependency specification is actually pretty interesting. First, it tells us that the tortilla package is located at ../tortilla (relative to taco). When you’re developing locally, Cargo uses this local path to find the tortilla crate. But when you publish the taco crate for public consumption, Cargo strips out the path = "../tortilla" setting because it’s only meaningful within your local workspace. Instead, the published taco crate will depend on version 1.3 of the published tortilla crate. This doubly-specified dependency gives you the benefits of a monorepo (for example, you get to work on tortilla and taco simultaneously and be sure that they stay compatible) without leaking that local setup to downstream users of your crates.

If you’ve been hurt by packaging incompatibilities before, the previous paragraph might have raised some red flags: allowing a dependency to come from one of two places could lead to problems if they get out-of-sync. Like, couldn’t you accidentally make a broken package by locally updating both your crates and then only publishing taco? You won’t see the breakage when building locally, but the published taco will be incompatible with the previously published tortilla. To deal with this issue, Cargo verifies packages before you publish them. When you type cargo publish --package taco, it packages up the taco crate (removing the local ../tortilla dependency) and then unpackages the new package in a temporary location and attempts to build it from scratch. This rebuild-from-scratch sees the taco crate exactly as a downstream user would, and so it will catch any incompatibilities between the existing, published tortilla and the about-to-be-published taco.

Cargo’s crate verification is not completely fool-proof because it only checks that the package compiles.3 In practice, I find that checking compilation is already pretty useful, but I also like to run other static checks.

Publish all my crates

Imagine you’ve been working in your workspace, updating your crates in backwards-incompatible ways. Now you want to bump tortilla to version 2.0 and taco to version 3.0 and publish them both. This isn’t too hard:

  1. Edit tortilla/Cargo.toml to increase the version to 2.0.
  2. Run cargo publish --package tortilla, and wait for it to appear on crates.io.
  3. Edit taco/Cargo.toml to increase its version to 3.0, and change its tortilla dependency. to 2.0.
  4. Run cargo publish --package taco.

The ordering is important here. You can’t publish the new taco before tortilla 2.0 is publicly available: if you try, the verification step will fail.

This multi-crate workflow works, but it has two problems:

  1. It can get tedious. With two crates it’s manageable, but what about when the dependency graph gets complicated? I worked for a client whose CI had custom Python scripts for checking versions, bumping versions, publishing things in the right order, and so on. It worked, but it wasn’t pretty.4
  2. It’s non-atomic: if in the process of verifying and packaging dependent crates you discover some problems with the dependencies then you’re out of luck because you’ve already published them. crates.io doesn’t allow deleting packages, so you’ll just have to yank5 the broken packages, increase the version number some more, and start publishing again. This one can’t be solved by scripts or third-party tooling: verifying the dependent crate requires the dependencies to be published.

Starting in mid-2024, my colleague Tor Hovland and I began working on native support for this in Cargo. A few months and dozens of code-review comments later, our initial implementation landed in Cargo 1.83.0. By the way, the Cargo team are super supportive of new contributors — I highly recommend going to their office hours if you’re interested.

How it works

In our implementation, we use a sort of registry “overlay” to verify dependent crates before their dependencies are published. This overlay wraps an upstream registry (like crates.io), allowing us to add local crates to the overlay without actually publishing them upstream. This kind of registry overlay is an interesting topic on its own. The “virtualization” of package sources is an often-requested feature that hasn’t yet been implemented in general because it’s tricky to design without exposing users to dependency confusion attacks: the more flexible you are about where dependencies come from, the easier it is for an attacker to sneak their way into your dependency tree. Our registry overlay passed scrutiny because it’s only available to Cargo internally, and only gets used for workspace-local packages during workspace publishing.

The registry overlay was pretty simple to implement, since it’s just a composition of two existing Cargo features: local registries and abstract sources. A local registry in Cargo is just a registry (like crates.io) that lives on your local disk instead of in the cloud. Cargo has long supported them because they’re useful for offline builds and integration testing. When packaging a workspace we create a temporary, initially-empty local registry for storing the new local packages as we produce them.

Our second ingredient is Cargo’s Source trait: since Cargo can pull dependencies from many different kinds of places (crates.io, private registries, git repositories, etc.), they already have a nice abstraction that encapsulates how to query availability, download, and cache packages from different places. So our registry overlay is just a new implementation of the Source trait that wraps two other Sources: the upstream registry (like crates.io) that we want to publish to, and the local registry that we put our local packages in. When someone queries our overlay source for a package, we check in the local registry first, and fall back to the upstream registry.

A diagram showing crates.io and a local registry feeding into an overlay

Now that we have our local registry overlay, the workspace-publishing workflow looks like this:

  1. Gather all the to-be-published crates and figure out any inter-dependencies. Sort them in a “dependency-compatible” order, meaning that every crate will be processed after all its dependencies.
  2. In that dependency-compatible order, package and verify each crate. For each crate:
    • Package it up, removing any mention of local path dependencies.
    • Unpackage it in a temporary location and check that it builds. This build step uses the local registry overlay, so that it thinks all the local dependencies that were previously added to the local overlay are really published.
    • “Publish” the crate in the local registry overlay.
  3. In the dependency-compatible order, actually upload all the crates to crates.io. This is done in parallel as much as possible. For example, if tortilla and carnitas don’t depend on one another but taco depends on them both, then tortilla and carnitas can be uploaded simultaneously.

It’s possible for the final upload to fail (if your network goes down, for example) and for some crates to remain unpublished; in that sense, the new workspace publishing workflow is not truly atomic. But because all of the new crates have already been verified with one another, you can just retry publishing the ones that failed to upload.

How to try it

Cargo, as critical infrastructure for Rust development, is pretty conservative about introducing new features. Multi-package publishing was recently promoted to a stable feature, but it is currently only available in nightly builds. If you’re using a recent nightly build of Cargo 1.90.0 or later, running cargo publish in a workspace will work as described in this blog post. If you don’t want to publish everything in your workspace, the usual package-selection arguments should work as expected: cargo publish --package taco --package tortilla will publish just taco and tortilla, while correctly managing any dependencies between them. Or you can exclude packages like cargo publish --exclude onions.

If you’re using a stable Rust toolchain, workspace publishing will be available in Cargo 1.90 in September 2025.


  1. If you use Node.js, Cargo is like the npm command and crates.io is like the NPM registry. If you use Python, Cargo is like pip (or Poetry, or uv) and crates.io is like PyPI.
  2. It can also contain lots of other useful workspace-scoped information, like dependencies that are common between crates or global compiler settings.
  3. To be even more precise, it only checks that the package compiles against the dependencies that are locked in your Cargo.lock file, which gets included in the package. If you or someone in your dependency tree doesn’t correctly follow semantic versioning, downstream users could still experience compilation problems. In practice, we’ve seen this cause binary packages to break because cargo install ignores the lock file by default.
  4. There are also several third-party tools (for example, cargo-release, cargo-smart-release, and release-plz) to help automate multi-crate releases. If one of these meets your needs, it might be better than a custom script.
  5. “Yanking” is Cargo’s mechanism for marking packages as broken without actually deleting their contents and breaking everyone’s builds.

July 10, 2025 12:00 AM

July 09, 2025

Well-Typed.Com

Developing an application from scratch (Haskell Unfolder #46)

Today, 2025-07-09, at 1830 UTC (11:30 am PDT, 2:30 pm EDT, 7:30 pm GMT, 20:30 CET, …) we are streaming the 46th episode of the Haskell Unfolder live on YouTube.

Developing an application from scratch (Haskell Unfolder #46)

In this episode targeted at beginners, we show the end-to-end application development process, starting from an empty directory. We’ll consider package configuration, taking advantage of editor integration, how to deal with dependencies, organizing code into modules, and parsing command line arguments. We will use this to write a simple but useful application.

About the Haskell Unfolder

The Haskell Unfolder is a YouTube series about all things Haskell hosted by Edsko de Vries and Andres Löh, with episodes appearing approximately every two weeks. All episodes are live-streamed, and we try to respond to audience questions. All episodes are also available as recordings afterwards.

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

There’s now also a web shop where you can buy t-shirts and mugs (and potentially in the future other items) with the Haskell Unfolder logo.

by andres, edsko at July 09, 2025 12:00 AM

July 07, 2025

Haskell Interlude

67: Alex McLean

Mike and Andres speak to Alex McLean who created the TidalCycles system for electronic music - implemented in Haskell of course. We talk about how Alex got into Haskell coming from Perl, how types helped him think about the structure of music and patterns, the architecture and evolution of TidalCycles, about art, community and making space for new ideas, and lots of things in between.

by Haskell Podcast at July 07, 2025 02:00 PM

June 28, 2025

Magnus Therning

Reading Redis responses

When I began experimenting with writing a new Redis client package I decided to use lazy bytestrings, because:

  1. aeson seems to prefer it – the main encoding and decoding functions use lazy byte strings, though there are strict variants too.
  2. the Builder type in bytestring produce lazy bytestrings.

At the time I was happy to see that attoparsec seemed to support strict and lazy bytestrings equally well.

To get on with things I also wrote the simplest function I could come up with for sending and receiving data over the network – I used send and recv from Network.Socket.ByteString.Lazy in network. The function was really simple

import Network.Socket.ByteString.Lazy qualified as SB

sendCmd :: Conn -> Command r -> IO (Result r)
sendCmd (Conn p) (Command k cmd) = withResource p $ \sock -> do
    _ <- SB.send sock $ toWireCmd cmd
    resp <- SB.recv sock 4096
    case decode resp of
        Left err -> pure $ Left $ RespError "decode" (TL.pack err)
        Right r -> pure $ k <$> fromWireResp cmd r

with decode defined like this

decode :: ByteString -> Either String Resp
decode = parseOnly resp

I knew I'd have to revisit this function, it was naïve to believe that a call to recv would always result in as single complete response. It was however good enough to get going. When I got to improving sendCmd I was a little surprised to find that I'd also have to switch to using strict bytestrings in the parser.

Interlude on the Redis serialisation protocol (RESP3)

The Redis protocol has some defining attributes

  • It's somewhat of a binary protocol. If you stick to keys and values that fall within the set of ASCII strings, then the protocol is humanly readable and you can rather easily use netcat or telnet as a client. However, you aren't limited to storing only readable strings.
  • It's somewhat of a request-response protocol. A notable exception is the publish-subscribe subset, but it's rather small and I reckon most Redis users don't use it.
  • It's somewhat of a type-length-value style protocol. Some of the data types include their length in bytes, e.g. bulk strings and verbatim strings. Other types include the number of elements, e.g. arrays and maps. A large number of them have no length at all, e.g. simple strings, integers, and doubles.

I suspect there are good reasons, I gather a lot of it has to do with speed. It does however cause one issue when writing a client: it's not possible to read a whole response without parsing it.

Rewriting sendCmd

With that extra information about the RESP3 protocol the naïve implementation above falls short in a few ways

  • The read buffer may contain more than one full message and give the definition of decode above any remaining bytes are simply dropped.1
  • The read buffer my contain less than one full message and then decode will return an error.2

Surely this must be solvable, because in my mind running the parser results in one of three things:

  1. Parsing is done and the result is returned, together with any input that wasn't consumed.
  2. The parsing is not done due to lack of input, this is typically encoded as a continuation.
  3. The parsing failed so the error is returned, together with input that wasn't consumed.

So, I started looking in the documentation for the module Data.Attoparsec.ByteString.Lazy in attoparsec. I was a little surprised to find that the Result type lacked a way to feed more input to a parser – it only has two constructors, Done and Fail:

data Result r
    = Fail ByteString [String] String
    | Done ByteString r

I'm guessing the idea is that the function producing the lazy bytestring in the first place should be able to produce more chunks of data on demand. That's likely what the lazy variant of recv does, but at the same time it also requires choosing a maximum length and that doesn't rhyme with RESP3. The lazy recv isn't quite lazy in the way I needed it to be.

When looking at the parser for strict bytestrings I calmed down. This parser follows what I've learned about parsers (it's not defined exactly like this; it's parameterised in its input but for the sake of simplicity I show it with ByteString as input):

data Result r
    = Fail ByteString [String] String
    | Partial (ByteString -> Result r)
    | Done ByteString r

Then to my delight I found that there's already a function for handling exactly my problem

parseWith :: Monad m => (m ByteString) -> Parser a -> ByteString -> m (Result a)

I only needed to rewrite the existing parser to work with strict bytestrings and work out how to write a function using recv (for strict bytestrings) that fulfils the requirements to be used as the first argument to parseWith. The first part wasn't very difficult due to the similarity between attoparsec's APIs for lazy and strict bytestrings. The second only had one complication. It turns out recv is blocking, but of course that doesn't work well with parseWith. I wrapped it in timeout based on the idea that timing out means there's no more data and the parser should be given an empty string so it finishes. I also decided to pass the parser as an argument, so I could use the same function for receiving responses for individual commands as well as for pipelines. The full receiving function is

import Data.ByteString qualified as BS
import Data.Text qualified as T
import Network.Socket.ByteString qualified as SB

recvParse :: S.Socket -> Parser r -> IO (Either Text (BS.ByteString, r))
recvParse sock parser = do
    parseWith receive parser BS.empty >>= \case
        Fail _ [] err -> pure $ Left (T.pack err)
        Fail _ ctxs err -> pure $ Left $ T.intercalate " > " (T.pack <$> ctxs) <> ": " <> T.pack err
        Partial _ -> pure $ Left "impossible error"
        Done rem result -> pure $ Right (rem, result)
  where
    receive =
        timeout 100_000 (SB.recv sock 4096) >>= \case
            Nothing -> pure BS.empty
            Just bs -> pure bs

Then I only needed to rewrite sendCmd and I wanted to do it in such a way that any remaining input data could be use in by the next call to sendCmd.3 I settled for modifying the Conn type to hold an IORef ByteString together with the socket and then the function ended up looking like this

sendCmd :: Conn -> Command r -> IO (Result r)
sendCmd (Conn p) (Command k cmd) = withResource p $ \(sock, remRef) -> do
    _ <- SBL.send sock $ toWireCmd cmd
    rem <- readIORef remRef
    recvParse sock rem resp >>= \case
        Left err -> pure $ Left $ RespError "recv/parse" err
        Right (newRem, r) -> do
            writeIORef remRef newRem
            pure $ k <$> fromWireResp cmd r

What's next?

I've started looking into pub/sub, and basically all of the work described in this post is a prerequisite for that. It's not very difficult on the protocol level, but I think it's difficult to come up with a design that allows maximal flexibility. I'm not even sure it's worthwhile the complexity.

Footnotes:

1

This isn't that much of a problem when sticking to the request-response commands, I think. It most certainly becomes a problem with pub/sub though.

2

I'm sure that whatever size of buffer I choose to use there'll be someone out there who's storing values that are larger. Then there's pipelining that makes it even more of an issue.

3

To be honest I'm not totally convinced there'll ever be any remaining input. Unless a single Conn is used by several threads – which would lead to much pain with the current implementation – or pub/sub is used – which isn't supported yet.

June 28, 2025 10:41 AM

June 27, 2025

Brent Yorgey

Competitive programming in Haskell: prefix sums

Competitive programming in Haskell: prefix sums

Posted on June 27, 2025
Tagged , , , , , ,

In a previous blog post I categorized a number of different techniques for calculating range queries. Today, I will discuss one of those techniques which is simple but frequently useful.

Precomputing prefix sums

Suppose we have a static sequence of values \(a_1, a_2, a_3, \dots, a_n\) drawn from some groupThat is, there is an associative binary operation with an identity element, and every element has an inverse.

, and want to be able to compute the total value (according to the group operation) of any contiguous subrange. That is, given a range \([i,j]\), we want to compute \(a_i \diamond a_{i+1} \diamond \dots \diamond a_j\) (where \(\diamond\) is the group operation). For example, we might have a sequence of integers and want to compute the sum, or perhaps the bitwise xor (but not the maximum) of all the values in any particular subrange.

Of course, we could simply compute \(a_i \diamond \dots \diamond a_j\) directly, but that takes \(O(n)\) time. With some simple preprocessing, it’s possible to compute the value of any range in constant time.

The key idea is to precompute an array \(P\) of prefix sums, so \(P_i = a_1 \diamond \dots \diamond a_i\). This can be computed in linear time via a scan; for example:

import Data.Array
import Data.List (scanl')

prefix :: Monoid a => [a] -> Array Int a
prefix a = listArray (0, length a) $ scanl' (<>) mempty a

Actually, I would typically use an unboxed array, which is faster but slightly more limited in its uses: import Data.Array.Unboxed, use UArray instead of Array, and add an IArray UArray a constraint.

Note that we set \(P_0 = 0\) (or whatever the identity element is for the group); this is why I had the sequence of values indexed starting from \(1\), so \(P_0\) corresponds to the empty sum, \(P_1 = a_1\), \(P_2 = a_1 \diamond a_2\), and so on.

Now, for the value of the range \([i,j]\), just compute \(P_j \diamond P_{i-1}^{-1}\)—that is, we start with a prefix that ends at the right place, then cancel or “subtract” the prefix that ends right before the range we want. For example, to find the sum of the integers \(a_5 + \dots + a_{10}\), we can compute \(P_{10} - P_4\).

range :: Group a => Array Int a -> Int -> Int -> a
range p i j = p!j <> inv (p!(i-1))

That’s why this only works for groups but not for general monoids: only in a group can we cancel unwanted values. So, for example, this works for finding the sum of any range, but not the maximum.

Practice problems

Want to practice? Here are a few problems that can be solved using techniques discussed in this post:

It is possible to generalize this scheme to 2D—that is, to compute the value of any subrectangle of a 2D grid of values from some group in only \(O(1)\) time. I will leave you the fun of figuring out the details.

If you’re looking for an extra challenge, here are a few harder problems which use techniques from this post as an important component, but require some additional nontrivial ingredients:

<noscript>Javascript needs to be activated to view comments.</noscript>

by Brent Yorgey at June 27, 2025 12:00 AM

June 25, 2025

Well-Typed.Com

Haskell records in 2025 (Haskell Unfolder #45)

Today, 2025-06-25, at 1830 UTC (11:30 am PDT, 2:30 pm EDT, 7:30 pm GMT, 20:30 CET, …) we are streaming the 45th episode of the Haskell Unfolder live on YouTube.

Haskell records in 2025 (Haskell Unfolder #45)

Haskell records as originally designed have had a reputation of being somewhat weird or, at worst, useless. A lot of features and modifications have been proposed over the years to improve the situation. But not all of these got implemented, or widespread adoption. The result is that the situation now is quite different from what it was in the old days, and additional changes are in the works. But the current state can be a bit confusing. Therefore, in this episode, we are going to look at how to make best use of Haskell records right now, discussing extensions such as DuplicateRecordFields, NoFieldSelectors, OverloadedRecordDot and OverloadedRecordUpdate, and we’ll get take a brief look at optics.

About the Haskell Unfolder

The Haskell Unfolder is a YouTube series about all things Haskell hosted by Edsko de Vries and Andres Löh, with episodes appearing approximately every two weeks. All episodes are live-streamed, and we try to respond to audience questions. All episodes are also available as recordings afterwards.

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

There’s now also a web shop where you can buy t-shirts and mugs (and potentially in the future other items) with the Haskell Unfolder logo.

by andres, edsko at June 25, 2025 12:00 AM

June 24, 2025

Haskell Interlude

66: Daniele Micciancio

Niki and Mike talked to Daniele Micciancio who is a professor at UC San Diego. He's been using Haskell for 20 years, and works in lattice cryptography. We talked to him about how he got into Haskell, using Haskell for teaching theoretical computer science and of course for his research and the role type systems and comonads could play in the design of cryptographic algorithms. Along the way, he gave an accessible introduction to post-quantum cryptography which we really enjoyed. We hope you do, too. 

by Haskell Podcast at June 24, 2025 02:00 PM

June 23, 2025

Brent Yorgey

Competitive programming in Haskell: range queries, classified

Competitive programming in Haskell: range queries, classified

Posted on June 23, 2025
Tagged , , , , ,

Static range queries

Suppose we have a sequence of values, which is static in the sense that the values in the sequence will never change, and we want to perform range queries, that is, for various ranges we want to compute the total of all consecutive values in the range, according to some binary combining operation. For example, we might want to compute the maximum, sum, or product of all the consecutive values in a certain subrange. We have various options depending on the kind of ranges we want and the algebraic properties of the operation.

  • If we want ranges corresponding to a sliding window, we can use an amortized queue structure to find the total of each range in \(O(1)\), for an arbitrary monoid.

  • If we want arbitrary ranges but the operation is a group, the solution is relatively straightforward: we can precompute all prefix sums, and subtract to find the result for an arbitrary range in \(O(1)\).

  • If the operation is an idempotent semigroup (that is, it has the property that \(x \diamond x = x\) for all \(x\)), we can use a sparse table, which takes \(O(n \lg n)\) time and space for precomputation, and then allows us to answer arbitrary range queries in \(O(1)\).

  • If the operation is an arbitrary monoid, we can use a sqrt tree, which uses \(O(n \lg \lg n)\) precomputed time and space, and allows answering arbitrary range queries in \(O(\lg \lg n)\). I will write about this in a future post.

Dynamic range queries

What if we want dynamic range queries, that is, we want to be able to interleave range queries with arbitrary updates to the values of the sequence?

  • If the operation is an arbitrary monoid, we can use a segment tree.
  • If the operation is a group, we can use a Fenwick tree.

I published a paper about Fenwick trees, which also discusses segment trees, but I should write more about them here!

Table

Here’s a table summarizing the above classification scheme. I plan to fill in links as I write blog posts about each row.

Sequence Ranges Operation Solution Precomputation Queries
Static Sliding window Monoid Amortized queue \(O(1)\) \(O(1)\)
Static Arbitrary Group Prefix sum table \(O(n)\) \(O(1)\)
Static Arbitrary Idempotent semigroup Sparse table \(O(n \lg n)\) \(O(1)\)
Static Arbitrary Monoid Sqrt tree \(O(n \lg \lg n)\) \(O(\lg \lg n)\)
Dynamic Arbitrary Group Fenwick tree \(O(n)\) \(O(\lg n)\)
Dynamic Arbitrary Monoid Segment tree \(O(n)\) \(O(\lg n)\)
<noscript>Javascript needs to be activated to view comments.</noscript>

by Brent Yorgey at June 23, 2025 12:00 AM

June 22, 2025

Philip Wadler

How to market Haskell to a mainstream programmer

An intriguing talk by Gabriella Gonzalez, delivered at Haskell Love 2020. Based largely on the famous marketing book, Crossing the Chasm. Gonzalez argues that marketing is not about hype, it is about setting priorities: what features and markets are you going to ignore? The key to adoption is to be able to solve a problem that people need solved today and where existing mainstream tools are inadequate. Joe Armstrong will tell you that the key to getting Erlang used was to approach failing projects and ask "Would you like us to build you a prototype?" Gonzalez makes a strong case that Haskell should first aim to capture the interpreters market. He points out that the finance/blockchain market may be another possibility. Recommended to me at Lambda Days by Pedro Abreu, host of the Type Theory Forall podcast.



by Philip Wadler (noreply@blogger.com) at June 22, 2025 07:07 PM

What is happening in Gaza is an injury to our collective conscience. We must be allowed to speak out

gaza-mate.JPG 

A powerful op-ed by Gabor Maté in the Toronto Star.

Just as nothing justifies the atrocities of October 7, nothing about October 7 justifies Israeli atrocities against the Palestinians, either before or since October 7. Recently, I listened to orthopedic surgeon Dr. Deirdre Nunan, like me a graduate of UBC’s Faculty of Medicine, recount her harrowing experiences serving in a Gaza hospital under the siege that followed Israel’s breaking of the ceasefire in March. Her depictions of unspeakable horror, enacted as policy by one of the world’s most sophisticated militaries, were soul shattering. Many other physicians — Canadian, American, Jewish, Muslim, Christian — who have worked in Gaza speak in similar terms. British doctors describe witnessing “a slaughterhouse.” All their testimonies are widely accessible. The leading medical journal Lancet editorialized that in its assault on health care facilities and personnel in Gaza, “the Israeli Government has acted with impunity … Many medical academies and health professional organizations that claim a commitment to social justice have failed to speak out.” ...

It may be true that antisemitic animus can lurk behind critiques of Zionism. But in my decades of advocacy for Palestinian rights including medical visits to Gaza and the West Bank, I have rarely witnessed it. When present, it has a certain tone that one can feel is directed at Jewishness itself, rather than at the theory and practice of Zionism or at Israel’s actions. What is far more common and genuinely confusing for many is that Israel and its supporters, Jews and non-Jews, habitually confound opposition to Israeli policy with antisemitism. This is akin to Vietnam War protesters being accused of anti-Americanism. How is opposing the napalming of human beings anti-American or, say, deploring Israel’s use of mass starvation as a weapon of war in any sense anti-Jewish? ...

People deserve the right to experience as much liberty to publicly mourn, question, oppose, deplore, denounce what they perceive as the perpetration of injustice and inhumanity as they are, in this country, to advocate for the aims and actions of the Israeli government and its Canadian abettors amongst our political leadership, academia, and media.

Even if we feel powerless to stop the first genocide we have ever watched on our screens in real time, allow at least our hearts to be broken openly, as mine is. And more, let us be free to take democratic, non-hateful action without fear of incurring the calumny of racism.

Thanks to a colleague in the Scottish Universities Jewish Staff Network for bringing it to my attention.

by Philip Wadler (noreply@blogger.com) at June 22, 2025 05:03 PM

The Provocateurs: Brave New Bullshit

[Reposting with update.]

Following two sell-out shows at the Fringe last year, I'm on at the Fringe again:

11.25 Monday 4 August, Stand 2 w/Lucy Remnant and Susan Morrison
17.40 Sunday 17 August, Stand 4 w/Smita Kheria and Sarah-Jane Judge
17.40 Tuesday 19 August, Stand 4 w/Cameron Wyatt and Susan Morrison

Shows are under the banner of The Provocateurs (formerly Cabaret of Dangerous Ideas). Tickets go on sale Wednesday 7 May, around noon. The official blurb is brief:

Professor Philip Wadler (The University of Edinburgh) separates the hopes and threats of AI from the chatbot bullshit.

Here is a longer blurb, from my upcoming appearance at Curious, run by the RSE, in September.
Brave New Bullshit
In an AI era, who wins and who loses?

Your future workday might look like this: 
  • You write bullet points.
  • You ask a chatbot to expand them into a report.
  • You send it to your boss ...
  • Who asks a chatbot to summarise it to bullet points.
Will AI help you to do your job or take it from you? Is it fair for AI to be trained on copyrighted material? Will any productivity gains benefit everyone or only a select few?
 
Join Professor Philip Wadler’s talk as he looks at the hopes and threats of AI, exploring who wins and who loses.

by Philip Wadler (noreply@blogger.com) at June 22, 2025 04:40 PM

June 20, 2025

Magnus Therning

Finding a type for Redis commands

Arriving at a type for Redis commands required a bit of exploration. I had some ideas early on that I for various reasons ended up dropping on the way. This is a post about my travels, hopefully someone finds it worthwhile reading.

The protocol

The Redis Serialization Protocol (RESP) initially reminded me of JSON and I thought that following the pattern of aeson might be a good idea. I decided up-front that I'd only support the latest version of RESP, i.e. version 3. So, I thought of a data type, Resp with a constructor for each RESP3 data type, and a pair of type classes, FromResp and ToResp for converting between Haskell types and RESP3. Then after some more reflection I realised that converting to RESP is largely pointless. The main reason to convert anything to RESP3 is to assemble a command, with its arguments, to send to Redis, but all commands are arrays of bulk strings so it's unlikely that anyone will actually use ToResp.1 So I scrapped the idea of ToResp. FromResp looked like this

class FromResp a where
    fromResp :: Value -> Either FromRespError a

When I started defining commands I didn't like the number of ByteString arguments that resulted in, so I defined a data type, Arg, and an accompanying type class for arguments, ToArg:

newtype Arg = Arg {unArg :: [ByteString]}
    deriving (Show, Semigroup, Monoid)

class ToArg a where
    toArg :: a -> Arg

Later on I saw that it might also be nice to have a type class specifically for keys, ToKey, though that's a wrapper for a single ByteString.

Implementing the functions to encode/decode the protocol were straight-forward applications of attoparsec and bytestring (using its Builder).

A command is a function in need of a sender

Even though supporting pipelining was one of the goals I felt a need to make sure I'd understood the protocol so I started off with single commands. The protocol is a simple request/response protocol at the core so I settled on this type for commands

type Cmd a = forall m. (Monad m) => (ByteString -> m ByteString) -> m (Either FromRespError a)

that is, a command is a function accepting a sender and returning an a.

I wrote a helper function for defining commands, sendCmd

sendCmd :: (Monad m, FromResp a) => [ByteString] -> (ByteString -> m ByteString) -> m (Either FromRespError a)
sendCmd cmdArgs send = do
    let cmd = encode $ Array $ map BulkString cmdArgs
    send cmd <&> decode >>= \case
        Left desc -> pure $ Left $ FromRespError "Decode" (Text.pack desc)
        Right v -> pure $ fromValue v

which made it easy to define commands. Here are two examples, append and mget:

append :: (ToArg a, ToArg b) => a -> b -> Cmd Int
append key val = sendCmd $ ["APPEND"] <> unArg (toArg key <> toArg val)

-- | https://redis.io/docs/latest/commands/mget/
mget :: (ToArg a, FromResp b) => NE.NonEmpty a -> Cmd (NE.NonEmpty b)
mget ks = sendCmd $ ["MGET"] <> unArg (foldMap1 toArg ks)

The function to send off a command and receive its response, sendAndRecieve, was just a call to send followed by a call to recv in network (the variants for lazy bytestrings).

I sort of liked this representation – there's always something pleasant with finding a way to represent something as a function. There's a very big problem with it though: it's difficult to implement pipelining!

Yes, Cmd is a functor since (->) r is a functor, and thus it's possible to make it an Applicative, e.g. using free. However, to implement pipelining it's necessary to

  1. encode all commands, then
  2. concatenate them all into a single bytestring and send it
  3. read the response, which is a concatenation of the individual commands' responses, and
  4. convert each separate response from RESP3.

That isn't easy when each command contains its own encoding and decoding. The sender function would have to relinquish control after encoding the command, and resume with the resume again later to decode it. I suspect it's doable using continuations, or monad-coroutine, but it felt complicated and rather than travelling down that road I asked for ideas on the Haskell Discourse. The replies lead me to a paper, Free delivery, and a bit later a package, monad-batcher. When I got the pointer to the package I'd already read the paper and started implementing the ideas in it, so I decided to save exploring monad-batcher for later.

A command for free delivery

The paper Free delivery is a perfect match for pipelining in Redis, and my understanding is that it proposes a solution where

  1. Commands are defined as a GADT, Command a.
  2. Two functions are defined to serialise and deserialise a Command a. In the paper they use String as the serialisation, so show and read is used.
  3. A type, ActionA a, is defined that combines a command with a modification of its a result. It implements Functor.
  4. A free type, FreeA f a is defined, and made into an Applicative with the constraint that f is a Functor.
  5. A function, serializeA, is defined that traverses a FreeA ActionA a serialising each command.
  6. A function, deserializeA, is defined that traverses a FreeA ActionA a deserialising the response for each command.

I defined a command type, Command a, with only three commands in it, echo, hello, and ping. I then followed the recipe above to verify that I could get it working at all. The Haskell used in the paper is showing its age, and there seems to be a Functor instance missing, but it was still straight forward and I could verify that it worked against a locally running Redis.

Then I made a few changes…

I renamed the command type to Cmd so I could use Command for what the paper calls ActionA.

data Cmd r where
    Echo :: Text -> Cmd Text
    Hello :: Maybe Int -> Cmd ()
    Ping :: Maybe Text -> Cmd Text

data Command a = forall r. Command !(r -> a) !(Cmd r)

instance Functor Command where
    fmap f (Command k c) = Command (f . k) c

toWireCmd :: Cmd r -> ByteString
toWireCmd (Echo msg) = _
toWireCmd (Hello ver) = _
toWireCmd (Ping msg) = _

fromWireResp :: Cmd r -> Resp -> Either RespError r
fromWireResp (Echo _) = fromResp
fromWireResp (Hello _) = fromResp
fromWireResp (Ping _) = fromResp

(At this point I was still using FromResp.)

I also replaced the free applicative defined in the paper and started using free. A couple of type aliases make it a little easier to write nice signatures

type Pipeline a = Ap Command a

type PipelineResult a = Validation [RespError] a

and defining individual pipeline commands turned into something rather mechanical. (I also swapped the order of the arguments to build a Command so I can use point-free style here.)

liftPipe :: (FromResp r) => Cmd r -> Pipeline r
liftPipe = liftAp . Command id

echo :: Text -> Pipeline Text
echo = liftPipe . Echo

hello :: Maybe Int -> Pipeline ()
hello = liftPipe . Hello

ping :: Maybe Text -> Pipeline Text
ping = liftPipe . Ping

One nice thing with switching to free was that serialisation became very simple

toWirePipeline :: Pipeline a -> ByteString
toWirePipeline = runAp_ $ \(Command _ c) -> toWireCmd c

On the other hand deserialisation became a little more involved, but it's not too bad

fromWirePipelineResp :: Pipeline a -> [Resp] -> PipelineResult a
fromWirePipelineResp (Pure a) _ = pure a
fromWirePipelineResp (Ap (Command k c) p) (r : rs) = fromWirePipelineResp p rs <*> (k <$> liftError singleton (fromWireResp c r))
fromWirePipelineResp _ _ = Failure [RespError "fromWirePipelineResp" "Unexpected wire result"]

Everything was working nicely and I started adding support for more commands. I used the small service from work to guide my choice of what commands to add. First out was del, then get and set. After adding lpush I was pretty much ready to try to replace hedis in the service from work.

data Cmd r where
    -- echo, hello, ping
    Del :: (ToKey k) => NonEmpty k -> Cmd Int
    Get :: (ToKey k, FromResp r) => k -> Cmd r
    Set :: (ToKey k, ToArg v) => k -> v -> Cmd Bool
    Lpush :: (ToKey k, ToArg v) => k -> NonEmpty v -> Cmd Int

However, when looking at the above definition started I thinking.

  • Was it really a good idea to litter Cmd with constraints like that?
  • Would it make sense to keep the Cmd type a bit closer to the actual Redis commands?
  • Also, maybe FromResp wasn't such a good idea after all, what if I remove it?

That brought me to the third version of the type for Redis commands.

Converging and simplifying

While adding new commands and writing instances of FromResp I slowly realised that my initial thinking of RESP3 as somewhat similar to JSON didn't really pan out. I had quickly dropped ToResp and now the instances of FromResp didn't sit right with me. They obviously had to "follow the commands", so to speak, but at the same time allow users to bring their own types. For instance, LSPUSH returns the number of pushed messages, but at the same time GET should be able to return an Int too. This led to Int's FromResp looking like this

instance FromResp Int where
    fromResp (BulkString bs) =
        case parseOnly (AC8.signed AC8.decimal) bs of
            Left s -> Left $ RespError "FromResp" (TL.pack s)
            Right n -> Right n
    fromResp (Number n) = Right $ fromEnum n
    fromResp _ = Left $ RespError "FromResp" "Unexpected value"

I could see this becoming worse, take the instance for Bool, I'd have to consider that

  • for MOVE Integer 1 means True and Integer 0 means False
  • for SET SimpleString "OK" means True
  • users would justifiably expect a bunch of bytestrings to be True, e.g. BulkString "true", BulkString "TRUE", BulkString "1", etc

However, it's impossible to cover all ways users can encode a Bool in a ByteString so no matter what I do users will end up having to wrap their Bool with newtype and implement a fitting FromResp. On top of that, even thought I haven't found any example of it yet, I fully expect there to be, somewhere in the large set of Redis commands, at least two commands each wanting an instance of a basic type that simply can't be combined into a single instance, meaning that the client library would need to do some newtype wrapping too.

No, I really didn't like it! So, could I get rid of FromResp and still offer users an API where they can user their own types as the result of commands?

To be concrete I wanted this

data Cmd r where
    -- other commands
    Get :: (ToKey k) => k -> Cmd (Maybe ByteString)

and I wanted the user to be able to conveniently turn a Cmd r into a Cmd s. In other words, I wanted a Functor instance. Making Cmd itself a functor isn't necessary and I just happened to already have a functor type that wraps Cmd, the Command type I used for pipelining. If I were to use that I'd need to write wrapper functions for each command though, but if I did that then I could also remove the ToKey~/~ToArg constraints from the constructors of Cmd r and put them on the wrapper instead. I'd get

data Cmd r where
    -- other commands
    Get :: Key -> Cmd (Maybe ByteString)

get :: (ToKey k) => k -> Command (Maybe ByteString)
get = Command id . Get . toKey

I'd also have to rewrite fromWireResp so it's more specific for each command. Instead of

fromWireResp :: Cmd r -> Resp -> Either RespError r
fromWireResp (Get _) = fromResp
...

I had to match up exactly on the possible replies to GET

fromWireResp :: Cmd r -> Resp -> Either RespError r
fromWireResp _ (SimpleError err desc) = Left $ RespError (T.decodeUtf8 err) (T.decodeUtf8 desc)
fromWireResp (Get _) (BulkString bs) = Right $ Just bs
fromWireResp (Get _) Null = Right Nothing
...
fromWireResp _ _ = Left $ RespError "fromWireResp" "Unexpected value"

Even though it was more code I liked it better than before, and I think it's slightly simpler code. I also hope it makes the use of the API is a bit simpler and clear.

Here's an example from the code for the service I wrote for work. It reads a UTC timestamp stored in timeKey, the timestamp is a JSON string so it needs to be decoded.

readUTCTime :: Connection -> IO (Maybe UTCTime)
readUTCTime conn =
    sendCmd conn (maybe Nothing decode <$> get timeKey) >>= \case
        Left _ -> pure Nothing
        Right datum -> pure datum

What's next?

I'm pretty happy with the command type for now, though I have a feeling I'll have to revisit Arg and ToArg at some point.

I've just turned the Connection type into a pool using resource-pool, and I started looking at pub/sub. The latter thing, pub/sub, will require some thought and experimentation I think. Quite possibly it'll end up in a post here too.

I also have a lot of commands to add.

Footnotes:

1

Of course one could use RESP3 as the serialisation format for storing values in Redis. Personally I think I'd prefer using something more widely used, and easier to read, such as JSON or BSON.

June 20, 2025 09:40 PM

Well-Typed.Com

GHC activities report: March–May 2025

This is the twenty-seventh edition of our GHC activities report, which describes the work Well-Typed are doing on GHC, Cabal, HLS and other parts of the core Haskell toolchain. The current edition covers roughly the months of March 2025 to May 2025. You can find the previous editions collected under the ghc-activities-report tag.

Sponsorship

We offer Haskell Ecosystem Support Packages to provide commercial users with support from Well-Typed’s experts, while investing in the Haskell community and its technical ecosystem including through the work described in this report. To find out more, read our recent announcement of these packages in partnership with the Haskell Foundation. We need funding to continue this essential maintenance work!

Many thanks to our Haskell Ecosystem Supporters: Channable and QBayLogic; to our existing clients who also contribute to making this work possible: Anduril, Juspay and Mercury; and to the HLS Open Collective for supporting HLS release management.

Team

The Haskell toolchain team at Well-Typed currently includes:

In addition, many others within Well-Typed contribute to GHC, Cabal and HLS occasionally, or contribute to other open source Haskell libraries and tools.

GHC

Highlights

Explicit level imports

Following on from our best paper prize at TFP 2025, Matthew implemented Explicit Level Imports (GHC proposal #682, !14241).

This feature allows one to specify whether imports are needed for running Template Haskell splices, or for generating Template Haskell quotes. This cleanly separates which modules are required at compile-time vs those that are required at runtime. For example, the pandoc package uses the Template Haskell deriveJSON function from the aeson package. This function can be imported using a splice import:

{-# LANGUAGE ExplicitLevelImports #-}
{-# LANGUAGE TemplateHaskell #-}
module Text.Pandoc.App.Opt where
import splice Data.Aeson.TH (deriveJSON, defaultOptions)
-- + many other non-splice imports

data XYZ = ...
$(deriveJSON defaultOptions ''XYZ)

Declaring the Data.Aeson.TH import as a splice import informs GHC that this module is required only at compile-time, and (crucially) that other, non-splice, imports, are not needed at compile time. This hugely improves the performance of tools that use -fno-code (such as HLS), as GHC is no longer required to pessimistically assume that all modules imported in a module enabling TemplateHaskell are required at compile-time.

GHCi support for primops

Andreas significantly improved GHCi performance by implementing certain GHC primops (such as integer arithmetic operations) directly in the bytecode interpreter (!13978).

Reductions in runtime of up to 50% have been observed, with GHC-in-GHCi speeding up by about 15%.

Improvements to the debugger

Rodrigo has made numerous improvements to the GHCi debugger, which had accumulated many bugs over the years due to lack of maintenance (!14246, !14195, !14160, !14106, !14196, !14195, !13997). Usability is improved across the board, with quality-of-life fixed such as adding breakpoints to all statements in a do block to make debugging more predictable (#25932) to significant performance improvements to :steplocal (#25779).

Rodrigo also published the ghc-debugger package including an executable ghc-debug-adapter. This implements the Debug Adapter Protocol, enabling Haskell programs to be stepped-through and debugged from editors such as Visual Studio Code. ghc-debug-adapter depends on many recent changes to GHC, so it is compatible only with the upcoming GHC 9.14.

Expressions in SPECIALISE pragmas

Sam worked with Simon Peyton Jones to finalise MR !12319 “Expressions in SPECIALISE pragmas”. This change means that a SPECIALISE pragma is no longer required to simply be a type signature, it can be an arbitrary expression. For full details, see GHC proposal #493, but two particular idioms are worth noting. Firstly, the type at which to specialise can now be specified by a type application, e.g.

myFunction :: forall a. Num a => a -> Maybe a -> (a, a)
myFunction = ...
{-# SPECIALISE myFunction @Int #-}

This specialise pragma is much more concise than:

{-# SPECIALISE :: Int -> Maybe Int -> (Int, Int) #-}

and less prone to breakage when the type of myFunction changes.

Secondly, the syntax enables value specialisation, for example:

mainFunction :: Bool -> ...
mainFunction debug = if debug then ... else ...
{-# SPECIALISE mainFunction False #-}

This tells GHC to optimise the non-debug code path, without the debug logic potentially getting in the way.

Multiple Home Units support in GHCi

GHC 9.14 is fully compatible with multiple home units, including all GHCi commands and the GHCi debugger, thanks to work by Hannes about which we recently published a blog post (!14231). Our new design generalises the architecture of GHCi so that multi-unit and single-unit sessions are handled in the same way. The uniform handling will make sure that multi-unit sessions work correctly as GHCi evolves.

GHC Releases

Frontend

  • Sam fixed a regression in the implementation of QuickLook in GHC 9.12 that would cause valid programs to be rejected (#26030, #25950, !14235).

  • Sam fixed a problem in which HasCallStack evidence was incorrectly cached in GHC, causing GHC to bogusly report identical call stacks (#25529, !14084).

  • Sam rectified several oversights in the initial implementation of the NamedDefaults language extension laid out in GHC proposal #409:

    • an issue with exporting named defaults (#25857, !14142),
    • lack of support for named default declarations for poly-kinded typeclasses such as Typeable (#25882, !14143),
    • an oversight in which NamedDefaults changed the behaviour of existing programs (#25775, !14075, ghc-proposals#694).
  • Sam fixed duplicate record fields sometimes being reported as unused when they are actually used (#24035, !14066).

  • Sam improved the error message emitted by GHC when one attempts to write a non-class at the head of a typeclass instance (#22688, !14105).

  • Sam fixed several issues with the renaming of export lists:

    • one issue involved the TypeData extension (#24027, !14119),
    • another was to do with bundled pattern synonyms (#25892, !14154).
  • Sam made “illegal term-level use” error messages more user friendly (#23982, !14122). That MR also improved the way GHC reports name qualification to the user, preferring to display the user-written qualification in error messages.

  • Sam fixed GHC creating unnecessary cycle-breaker variables, which could cause problems for type-checking plugins that weren’t expecting them (#25933, !14206).

  • Sam implemented the deprecation described in GHC proposal #448: the combination of ScopedTypeVariables and TypeApplications no longer enables the use of type applications in constructor patterns, requiring instead the TypeAbstractions extension (!13551).

  • Sam fixed an issue in which equal types compared non-equal under TypeRep-equality by implementing a suggestion by Krzysztof Gogolewski (#25998, !14281).

  • Sam improved the documentation surrounding defaulting in the user’s guide, providing a high-level overview of the different mechanisms in GHC for defaulting ambiguous type variables (#25807, !14057).

Backend

  • Ben and Sam investigated testsuite failures in the LLVM backend (#25769). They identified many different issues:

    • #25730 concerned incorrect type annotations in the generated LLVM, fixed in !13936.
    • #25770, #25773 were symptoms of a serious bug in the implementation of floating-point register padding (fixed in !14134),
    • !14129 fixed incorrect type annotations in the LLVM for atomic operations, adding new tests to Cmm Lint to avoid similar bugs in the future.
    • Most of the other bugs involved initializers/finalizers, which were due to incorrect linkage annotation for builtin arrays (fixed in !14157).
  • Rodrigo worked with Simon Peyton Jones to fix an issue in which the presence or absence of unrelated RULES could affect compilation, leading to non-deterministic compilation (#25170, !13884).

  • Andreas fixed a bug in which GHC would construct over-saturated constructor applications, which caused a panic when building the xmonad-contrib package (#23865, !14036).

  • Andreas made GHC constant-fold away invalid tagToEnum# calls to a particular error expression, which unlocks dead-code elimination opportunities and makes it easier to debug issues that arise from invalid use of tagToEnum# (#25976, !14254)

  • Andreas added -fhuge-code-sections, an off-by-default flag that provides a workaround for AArch64 users running into bug #24648.

  • Matthew overhauled the driver to bring one-shot compilation and make mode in line with each other, by consistently using the module graph to answer queries related to the module import structure (!14198, !14209). This was partly motivated by implementation requirements of the “Explicit Splice Imports” proposal, for which module graph queries are a central component.

  • Matthew added support for “fixed” nodes in the module graph, which can be used for modules without corresponding source-files that are e.g. generated via the GHC API (#25920, !14187).

  • Rodrigo moved some DynFlags consistency checks in order to consolidate the logic into the core makeDynFlagsConsistent function.

  • Ben changed how GHC prints Uniques to the user to avoid NULL characters (#25989, !14265).

Compiler performance

  • Matthew improved the performance of the bytecode assembler by ensuring the code is properly specialised (!13983).

  • Matthew made sure that forceModIface properly forced all fields of ModIface in order to avoid space leaks (!14078).

  • Matthew removed unused mi_used_th and mi_hpc fields from interfaces, which were needlessly bloating interface files (!14073).

  • Matthew avoided allocation of intermediate ByteStrings when serialising FastStrings (#25861, !14107).

Recompilation checking

  • Matthew overhauled the ModIface datatype, splitting it up in a more logical way which makes it easier to identify which parts contribute to recompilation checking (!14102). This allowed fixing several issues with recompilation checking in !14118, such as:

    • it ignored changes in exported named default declarations (#25855),
    • it did not take into account changes to COMPLETE pragmas (#25854).
  • Matthew added the -fwrite-if-self-recomp flag which controls whether to include self-recompilation information, which avoids writing recompilation information in cases such as producing binary distributions for which recompilation is not a concern (#10424, #22188, !8604).

  • Matthew refactored the implementation of recompilation-checking to ensure that all flags that influence recompilations are correctly taken into account (#25837, !14085).

  • Sam improved recompilation checking for export lists in !14178 (#25881). In practice, this means that modules with explicit import lists will no longer always trigger the recompilation of a module they depend on when that module’s export list changes, as long as the explicitly imported items are preserved.

  • Matthew improved the output of -dump-hi-diff to properly display the precise change in flags which caused recompilation (#25571, !13792).

Runtime system

  • Ben fixed a bug in which the WinIO I/O manager was being inconsistently selected (#25838, !14088).

  • Ben diagnosed and fixed a linking issue affecting global offset table usage on macOS that manifested in incorrect runtime results when using the GHC API (#25577, !13991).

  • Ben fixed an issue in which GHC’s RTS linker was too eager to load shared objects which refer to undefined symbols (#25943, !14290).

  • Ben significantly improved the performance of the RTS linker, culminating in a reduction in GHCi startup time from 2.5s to 250ms on Windows (#26052, #26009, !14339).

GHCi & bytecode interpreter

  • Andreas fixed several endianness issues in the interpreter (#25791, !14172).

  • Matthew implemented a fix for the mishandling of stack underflow frames (#25750, !13957). A remaining issue was subsequently identified (#25865) and fixed by Andreas’ work on the interpreter (!13978).

  • Matthew ensured that all top-level functions are visible when loading a module in the interpreter, not only exported functions (!14032).

  • Matthew fixed a bug in the simplifier that caused Core Lint failures when compiling certain programs (#25790, !14019).

  • Matthew fixed a regression in the way that GHCi would import modules that involved Cabal mixins stanzas (#25951, !14222).

Libraries

  • Ben exposed the constructors and fields of the Backtrace datatype in base (#26049, !14351).

  • Ben brought base changelog entries up to date in !14320.

Build system & packaging

  • Sam fixed GHC not working properly if the installation path contains spaces on Windows (#25204, !14137).

  • Ben fixed a couple of issues relating to the llvm-as flag:

    • the value of the field was incorrectly set (#25856, !14104),
    • the information in the field was passed incorrectly to clang (#25793, !14025).

Testsuite

  • Andreas fixed a bug in which tests requiring the interpreter would be run even if the compiler didn’t support it (#25533, !14201).

  • Matthew fixed an issue with tests that used Template Haskell in the profiled dynamic way (#25947, !14215).

Cabal

  • Mikolaj prepared the 3.14.2.0 bugfix release to the Cabal package suite (including the Cabal library and cabal-install).

  • Matthew fixed all known regressions in the 3.14.1.0 release of cabal-install:

    • Issue #10759 to do with picking up unwanted environment files #10828.
    • Duplication of environment variables (#10718, #10827).
    • Interaction of multi-repl with internal dependencies (#10775, #10841).
    • A working directory oversight (#10772, #10800).
    • The pkgname_datadir environment variable incorrectly using a relative path (#10717, #10830).
  • Matthew updated the outdated and gen-bounds commands to work with the v2- project infrastructure (#10878, #10840).

  • Matthew ensured that C++ environment variables are passed to configure scripts (#10797, #10844).

  • Matthew added a module name validity check to the cabal check command (#10295, #10816).

  • Matthew updated the Cabal CI to use GHC 9.12.2 and GHC 9.6.7 (#10893).

  • Matthew improved the testsuite output to make it more readable (#8419, #10837).

  • Matthew fixed an issue in which changes to the PATH environment variable would incorrectly not trigger recompilation (#2015, #10817).

HLS

  • Hannes prepared the HLS release 2.10.0.0 (#4448)

  • Zubin prepared the HLS release 2.11.0.0 (#4585)

  • Zubin added support for GHC 9.12.2 in HLS (#4527)

  • Zubin reworked the HLS release CI infrastructure (#4481)

Haskell.org infrastructure

Ben worked to refactor and migrate a variety of core haskell.org services from Equinix Metal to new infrastructure at OpenCape:

  • hoogle.haskell.org has been Nixified and now periodically reindexes automatically.

  • Haskell.org’s primary mail server, mail.haskell.org, has been Nixified and updated.

  • Haskell.org’s many mailing lists have been migrated to Mailman 3

  • gitlab.haskell.org has been migrated to OpenCape and updated

  • The Hackage documentation builder has been completely revamped with a more maintainable deployment strategy and a broader set of native packages available, enabling more Hackage packages to benefit from automatically-built documentation.

With these maintainability improvements we hope that haskell.org’s core infrastructure team can be more easily grown in the future.

by adam, andreask, ben, hannes, matthew, mikolaj, rodrigo, sam, zubin at June 20, 2025 12:00 AM

June 17, 2025

Magnus Therning

Why I'm writing a Redis client package

A couple of weeks ago I needed a small, hopefully temporary, service at work. It bridges a gap in functionality provided by a legacy system and the functionality desired by a new system. The legacy system is cumbersome to work with, so we tend to prefer building anti-corruption layers rather than changing it directly, and sometimes we implement it as separate services.

This time it was good enough to run the service as a cronjob, but it did need to keep track of when it ran the last time. It felt silly to spin up a separate DB just to keep a timestamp, and using another service's DB is something I really dislike and avoid.1 So, I ended up using the Redis instance that's used as a cache by a OSS service we host.

The last time I had a look at the options for writing a Redis client in Haskell I found two candidates, hedis and redis-io. At the time I wrote a short note about them. This time around I found nothing much has changed, they are still the only two contenders and they still suffer from the same issues

  • hedis has still has the same API and I still find it as awkward.
  • redis-io still requires a logger.

I once again decided to use hedis and wrote the service for work in a couple of days, but this time I thought I'd see what it would take to remove the requirement on tinylog from redis-io. I spent a few evenings on it, though I spent most time on "modernising" the dev setup, using Nix to build, re-format using fourmolu, etc. I did the same for redis-resp, the main dependency of redis-io. The result of that can be found on my gitlab account:

At the moment I won't take that particular experiment any further and given that the most recent change to redis-io was in 2020 (according to its git repo) I don't think there's much interest upstream either.

Making the changes to redis-io and redis-resp made me a little curious about the Redis protocol so I started reading about it. It made me start thinking about implementing a client lib myself. How hard could it be?

I'd also asked a question about Redis client libs on r/haskell and a response led me to redis-schema. It has a very good README, and its section on transactions with its observation that Redis transactions are a perfect match for Applicative. This pushed me even closer to start writing a client lib. What pushed me over the edge was the realisation that pipelining also is a perfect match for Applicative.

For the last few weeks I've spent some of my free time reading and experimenting and I'm enjoying it very much. We'll see where it leads, but hopefully I'll at least have bit more to write about it.

Footnotes:

1

One definition of a microservice I find very useful is "a service that owns its own DB schema."

June 17, 2025 08:43 PM

June 16, 2025

Brent Yorgey

Monads are not like burritos

Monads are not like burritos

Posted on June 16, 2025
Tagged , , , , ,

In January 2009, while just a baby first-year PhD student, I wrote a blog post titled Abstraction, intuition, and the “monad tutorial fallacy”. In it, I made the argument that humans tend to learn best by first grappling with concrete examples, and only later proceeding to higher-level intuition and analogies; hence, it’s a mistake to think that clearly presenting your intuition for a topic will help other people understand it. Analogies and intuition can help, but only when accompanied by concrete examples and active engagement. To illustrate the point, I made up a fictitious programmer with a fictitious analogy.

But now Joe goes and writes a monad tutorial called “Monads are Burritos,” under the well-intentioned but mistaken assumption that if other people read his magical insight, learning about monads will be a snap for them. “Monads are easy,” Joe writes. “Think of them as burritos.” Joe hides all the actual details about types and such because those are scary, and people will learn better if they can avoid all that difficult and confusing stuff. Of course, exactly the opposite is true, and all Joe has done is make it harder for people to learn about monads…

My intention was to choose a fictitious analogy which was obviously ridiculous and silly, as a parody of many of the monad tutorials which existed at the time (and still do). Mark Jason Dominus then wrote a blog post, Monads are like burritos, pointing out that actually, monads are kinda like burritos. It’s really funny, though I don’t think it’s actually a very good analogy, and my guess is that Mark would agree: it was clearly written as a silly joke and not as a real way to explain monads.

In any case, from that point the “monads are burritos” meme took on a life of its own. For example:

I even joined in the fun and made this meme image about bad monad tutorials:

Of course there are lots of people who still understand that it was all just a silly joke. Recently, however, I’ve seen several instances where people apparently believe “monads are burritos” is a real, helpful thing and not just a joke meme. For example, see this thread on lobste.rs, or this Mastodon post.

So, to set the record straight: “monads are burritos” is not a helpful analogy!Yes, I am writing a blog post because People Are Wrong On The Internet, and I know it probably won’t make any difference, but here we are.

Why not, you ask? To expand on my reasons from a 10-year-old Reddit comment:

  • The burrito analogy strongly implies that a value of type m a somehow “contains” a value (or values) of type a. But that is not true for all monads (e.g. there is no sense in which a value of type IO String contains a String).
  • Relatedly, the analogy also implies that a value of type m a can be “unwrapped” to get an a, but this is impossible for many monads.
  • It is not actually very easy to take a burrito containing a burrito and merge it into a single-level burrito. At least this is not in any sense a natural operation on burritos. Perhaps you could argue that it is always easy to remove outer tortilla layers (but not the innermost one since the food will all fall out), but this is a bad analogy, since in general join does not just “remove” an outer layer, but somehow merges the effects of two layers into one.

Actually, burritos are a great analogy for the Identity monad! …but not much beyond that.

On a more positive note, my sense is that the average pedagogical quality of Haskell materials, and monad tutorials in particular, has indeed gone up significantly since 2009. I’d love to think this can be at least partially attributed to my original blog post, though of course it’s impossible to know that for sure.

<noscript>Javascript needs to be activated to view comments.</noscript>

by Brent Yorgey at June 16, 2025 12:00 AM

June 15, 2025

Chris Reade

PenroseKiteDart User Guide

Introduction

(Updated June 2025 for PenroseKiteDart version 1.4)

PenroseKiteDart is a Haskell package with tools to experiment with finite tilings of Penrose’s Kites and Darts. It uses the Haskell Diagrams package for drawing tilings. As well as providing drawing tools, this package introduces tile graphs (Tgraphs) for describing finite tilings. (I would like to thank Stephen Huggett for suggesting planar graphs as a way to reperesent the tilings).

This document summarises the design and use of the PenroseKiteDart package.

PenroseKiteDart package is now available on Hackage.

The source files are available on GitHub at https://github.com/chrisreade/PenroseKiteDart.

There is a small art gallery of examples created with PenroseKiteDart here.

Index

  1. About Penrose’s Kites and Darts
  2. Using the PenroseKiteDart Package (initial set up).
  3. Overview of Types and Operations
  4. Drawing in more detail
  5. Forcing in more detail
  6. Advanced Operations
  7. Other Reading

1. About Penrose’s Kites and Darts

The Tiles

In figure 1 we show a dart and a kite. All angles are multiples of 36^{\circ} (a tenth of a full turn). If the shorter edges are of length 1, then the longer edges are of length \phi, where \phi = (1+ \sqrt{5})/ 2 is the golden ratio.

Figure 1: The Dart and Kite Tiles
Figure 1: The Dart and Kite Tiles

Aperiodic Infinite Tilings

What is interesting about these tiles is:

It is possible to tile the entire plane with kites and darts in an aperiodic way.

Such a tiling is non-periodic and does not contain arbitrarily large periodic regions or patches.

The possibility of aperiodic tilings with kites and darts was discovered by Sir Roger Penrose in 1974. There are other shapes with this property, including a chiral aperiodic monotile discovered in 2023 by Smith, Myers, Kaplan, Goodman-Strauss. (See the Penrose Tiling Wikipedia page for the history of aperiodic tilings)

This package is entirely concerned with Penrose’s kite and dart tilings also known as P2 tilings.

In figure 2 we add a temporary green line marking purely to illustrate a rule for making legal tilings. The purpose of the rule is to exclude the possibility of periodic tilings.

If all tiles are marked as shown, then whenever tiles come together at a point, they must all be marked or must all be unmarked at that meeting point. So, for example, each long edge of a kite can be placed legally on only one of the two long edges of a dart. The kite wing vertex (which is marked) has to go next to the dart tip vertex (which is marked) and cannot go next to the dart wing vertex (which is unmarked) for a legal tiling.

Figure 2: Marked Dart and Kite
Figure 2: Marked Dart and Kite

Correct Tilings

Unfortunately, having a finite legal tiling is not enough to guarantee you can continue the tiling without getting stuck. Finite legal tilings which can be continued to cover the entire plane are called correct and the others (which are doomed to get stuck) are called incorrect. This means that decomposition and forcing (described later) become important tools for constructing correct finite tilings.

2. Using the PenroseKiteDart Package

You will need the Haskell Diagrams package (See Haskell Diagrams) as well as this package (PenroseKiteDart). When these are installed, you can produce diagrams with a Main.hs module. This should import a chosen backend for diagrams such as the default (SVG) along with Diagrams.Prelude.

    module Main (main) where
    
    import Diagrams.Backend.SVG.CmdLine
    import Diagrams.Prelude

For Penrose’s Kite and Dart tilings, you also need to import the PKD module and (optionally) the TgraphExamples module.

    import PKD
    import TgraphExamples

Then to ouput someExample figure

    fig::Diagram B
    fig = someExample

    main :: IO ()
    main = mainWith fig

Note that the token B is used in the diagrams package to represent the chosen backend for output. So a diagram has type Diagram B. In this case B is bound to SVG by the import of the SVG backend. When the compiled module is executed it will generate an SVG file. (See Haskell Diagrams for more details on producing diagrams and using alternative backends).

3. Overview of Types and Operations

Half-Tiles

In order to implement operations on tilings (decompose in particular), we work with half-tiles. These are illustrated in figure 3 and labelled RD (right dart), LD (left dart), LK (left kite), RK (right kite). The join edges where left and right halves come together are shown with dotted lines, leaving one short edge and one long edge on each half-tile (excluding the join edge). We have shown a red dot at the vertex we regard as the origin of each half-tile (the tip of a half-dart and the base of a half-kite).

Figure 3: Half-Tile pieces showing join edges (dashed) and origin vertices (red dots)
Figure 3: Half-Tile pieces showing join edges (dashed) and origin vertices (red dots)

The labels are actually data constructors introduced with type operator HalfTile which has an argument type (rep) to allow for more than one representation of the half-tiles.

    data HalfTile rep 
      = LD rep -- Left Dart
      | RD rep -- Right Dart
      | LK rep -- Left Kite
      | RK rep -- Right Kite
      deriving (Show,Eq)

Tgraphs

We introduce tile graphs (Tgraphs) which provide a simple planar graph representation for finite patches of tiles. For Tgraphs we first specialise HalfTile with a triple of vertices (positive integers) to make a TileFace such as RD(1,2,3), where the vertices go clockwise round the half-tile triangle starting with the origin.

    type TileFace  = HalfTile (Vertex,Vertex,Vertex)
    type Vertex    = Int  -- must be positive

The function

    makeTgraph :: [TileFace] -> Tgraph

then constructs a Tgraph from a TileFace list after checking the TileFaces satisfy certain properties (described below). We also have

    faces :: Tgraph -> [TileFace]

to retrieve the TileFace list from a Tgraph.

As an example, the fool (short for fool’s kite and also called an ace in the literature) consists of two kites and a dart (= 4 half-kites and 2 half-darts):

    fool :: Tgraph
    fool = makeTgraph [RD (1,2,3), LD (1,3,4)   -- right and left dart
                      ,LK (5,3,2), RK (5,2,7)   -- left and right kite
                      ,RK (5,4,3), LK (5,6,4)   -- right and left kite
                      ]

To produce a diagram, we simply draw the Tgraph

    foolFigure :: Diagram B
    foolFigure = draw fool

which will produce the diagram on the left in figure 4.

Alternatively,

    foolFigure :: Diagram B
    foolFigure = labelled drawj fool

will produce the diagram on the right in figure 4 (showing vertex labels and dashed join edges).

Figure 4: Diagram of fool without labels and join edges (left), and with (right)
Figure 4: Diagram of fool without labels and join edges (left), and with (right)

When any (non-empty) Tgraph is drawn, a default orientation and scale are chosen based on the lowest numbered join edge. This is aligned on the positive x-axis with length 1 (for darts) or length \phi (for kites).

Tgraph Properties

Tgraphs are actually implemented as

    newtype Tgraph = Tgraph [TileFace]
                     deriving (Show)

but the data constructor Tgraph is not exported to avoid accidentally by-passing checks for the required properties. The properties checked by makeTgraph ensure the Tgraph represents a legal tiling as a planar graph with positive vertex numbers, and that the collection of half-tile faces are both connected and have no crossing boundaries (see note below). Finally, there is a check to ensure two or more distinct vertex numbers are not used to represent the same vertex of the graph (a touching vertex check). An error is raised if there is a problem.

Note: If the TilFaces are faces of a planar graph there will also be exterior (untiled) regions, and in graph theory these would also be called faces of the graph. To avoid confusion, we will refer to these only as exterior regions, and unless otherwise stated, face will mean a TileFace. We can then define the boundary of a list of TileFaces as the edges of the exterior regions. There is a crossing boundary if the boundary crosses itself at a vertex. We exclude crossing boundaries from Tgraphs because they prevent us from calculating relative positions of tiles locally and create touching vertex problems.

For convenience, in addition to makeTgraph, we also have

    makeUncheckedTgraph :: [TileFace] -> Tgraph
    checkedTgraph   :: [TileFace] -> Tgraph

The first of these (performing no checks) is useful when you know the required properties hold. The second performs the same checks as makeTgraph except that it omits the touching vertex check. This could be used, for example, when making a Tgraph from a sub-collection of TileFaces of another Tgraph.

Main Tiling Operations

There are three key operations on finite tilings, namely

    decompose :: Tgraph -> Tgraph
    force     :: Tgraph -> Tgraph
    compose   :: Tgraph -> Tgraph

Decompose

Decomposition (also called deflation) works by splitting each half-tile into either 2 or 3 new (smaller scale) half-tiles, to produce a new tiling. The fact that this is possible, is used to establish the existence of infinite aperiodic tilings with kites and darts. Since our Tgraphs have abstracted away from scale, the result of decomposing a Tgraph is just another Tgraph. However if we wish to compare before and after with a drawing, the latter should be scaled by a factor 1/{\phi} = \phi - 1 times the scale of the former, to reflect the change in scale.

Figure 5: fool (left) and decompose fool (right)
Figure 5: fool (left) and decompose fool (right)

We can, of course, iterate decompose to produce an infinite list of finer and finer decompositions of a Tgraph

    decompositions :: Tgraph -> [Tgraph]
    decompositions = iterate decompose

Force

Force works by adding any TileFaces on the boundary edges of a Tgraph which are forced. That is, where there is only one legal choice of TileFace addition consistent with the seven possible vertex types. Such additions are continued until either (i) there are no more forced cases, in which case a final (forced) Tgraph is returned, or (ii) the process finds the tiling is stuck, in which case an error is raised indicating an incorrect tiling. [In the latter case, the argument to force must have been an incorrect tiling, because the forced additions cannot produce an incorrect tiling starting from a correct tiling.]

An example is shown in figure 6. When forced, the Tgraph on the left produces the result on the right. The original is highlighted in red in the result to show what has been added.

Figure 6: A Tgraph (left) and its forced result (right) with the original shown red
Figure 6: A Tgraph (left) and its forced result (right) with the original shown red

Compose

Composition (also called inflation) is an opposite to decompose but this has complications for finite tilings, so it is not simply an inverse. (See Graphs,Kites and Darts and Theorems for more discussion of the problems). Figure 7 shows a Tgraph (left) with the result of composing (right) where we have also shown (in pale green) the faces of the original that are not included in the composition – the remainder faces.

Figure 7: A Tgraph (left) and its (part) composed result (right) with the remainder faces shown pale green
Figure 7: A Tgraph (left) and its (part) composed result (right) with the remainder faces shown pale green

Under some circumstances composing can fail to produce a Tgraph because there are crossing boundaries in the resulting TileFaces. However, we have established that

  • If g is a forced Tgraph, then compose g is defined and it is also a forced Tgraph.

Try Results

It is convenient to use types of the form Try a for results where we know there can be a failure. For example, compose can fail if the result does not pass the connected and no crossing boundary check, and force can fail if its argument is an incorrect Tgraph. In situations when you would like to continue some computation rather than raise an error when there is a failure, use a try version of a function.

    tryCompose :: Tgraph -> Try Tgraph
    tryForce   :: Tgraph -> Try Tgraph

We define Try as a synonym for Either ShowS (which is a monad) in module Tgraph.Try.

type Try a = Either ShowS a

(Note ShowS is String -> String). Successful results have the form Right r (for some correct result r) and failure results have the form Left (s<>) (where s is a String describing the problem as a failure report).

The function

    runTry:: Try a -> a
    runTry = either error id

will retrieve a correct result but raise an error for failure cases. This means we can always derive an error raising version from a try version of a function by composing with runTry.

    force = runTry . tryForce
    compose = runTry . tryCompose

Elementary Tgraph and TileFace Operations

The module Tgraph.Prelude defines elementary operations on Tgraphs relating vertices, directed edges, and faces. We describe a few of them here.

When we need to refer to particular vertices of a TileFace we use

    originV :: TileFace -> Vertex -- the first vertex - red dot in figure 2
    oppV    :: TileFace -> Vertex -- the vertex at the opposite end of the join edge from the origin
    wingV   :: TileFace -> Vertex -- the vertex not on the join edge

A directed edge is represented as a pair of vertices.

    type Dedge = (Vertex,Vertex)

So (a,b) is regarded as a directed edge from a to b.

When we need to refer to particular edges of a TileFace we use

    joinE  :: TileFace -> Dedge  -- shown dotted in figure 2
    shortE :: TileFace -> Dedge  -- the non-join short edge
    longE  :: TileFace -> Dedge  -- the non-join long edge

which are all directed clockwise round the TileFace. In contrast, joinOfTile is always directed away from the origin vertex, so is not clockwise for right darts or for left kites:

    joinOfTile:: TileFace -> Dedge
    joinOfTile face = (originV face, oppV face)

In the special case that a list of directed edges is symmetrically closed [(b,a) is in the list whenever (a,b) is in the list] we can think of this as an edge list rather than just a directed edge list.

For example,

    internalEdges :: Tgraph -> [Dedge]

produces an edge list, whereas

    boundary :: Tgraph -> [Dedge]

produces single directions. Each directed edge in the resulting boundary will have a TileFace on the left and an exterior region on the right. The function

    dedges :: Tgraph -> [Dedge]

produces all the directed edges obtained by going clockwise round each TileFace so not every edge in the list has an inverse in the list.

Note: There is now a class HasFaces (introduced in version 1.4) which includes instances for both Tgraph and [TileFace] and others. This allows some generalisations. In particular the more general types of the above three functions are now

    internalEdges :: HasFaces a => a -> [Dedge]
    boundary      :: HasFaces a => a -> [Dedge] 
    dedges        :: HasFaces a => a -> [Dedge]   

Patches (Scaled and Positioned Tilings)

Behind the scenes, when a Tgraph is drawn, each TileFace is converted to a Piece. A Piece is another specialisation of HalfTile using a two dimensional vector to indicate the length and direction of the join edge of the half-tile (from the originV to the oppV), thus fixing its scale and orientation. The whole Tgraph then becomes a list of located Pieces called a Patch.

    type Piece = HalfTile (V2 Double)
    type Patch = [Located Piece]

Piece drawing functions derive vectors for other edges of a half-tile piece from its join edge vector. In particular (in the TileLib module) we have

    drawPiece :: Piece -> Diagram B
    dashjPiece :: Piece -> Diagram B
    fillPieceDK :: Colour Double -> Colour Double -> Piece -> Diagram B

where the first draws the non-join edges of a Piece, the second does the same but adds a dashed line for the join edge, and the third takes two colours – one for darts and one for kites, which are used to fill the piece as well as using drawPiece.

Patch is an instances of class Transformable so a Patch can be scaled, rotated, and translated.

Vertex Patches

It is useful to have an intermediate form between Tgraphs and Patches, that contains information about both the location of vertices (as 2D points), and the abstract TileFaces. This allows us to introduce labelled drawing functions (to show the vertex labels) which we then extend to Tgraphs. We call the intermediate form a VPatch (short for Vertex Patch).

    type VertexLocMap = IntMap.IntMap (Point V2 Double)
    data VPatch = VPatch {vLocs :: VertexLocMap,  vpFaces::[TileFace]} deriving Show

and

    makeVP :: Tgraph -> VPatch

calculates vertex locations using a default orientation and scale.

VPatch is made an instance of class Transformable so a VPatch can also be scaled and rotated.

One essential use of this intermediate form is to be able to draw a Tgraph with labels, rotated but without the labels themselves being rotated. We can simply convert the Tgraph to a VPatch, and rotate that before drawing with labels.

    labelled draw (rotate someAngle (makeVP g))

We can also align a VPatch using vertex labels.

    alignXaxis :: (Vertex, Vertex) -> VPatch -> VPatch 

So if g is a Tgraph with vertex labels a and b we can align it on the x-axis with a at the origin and b on the positive x-axis (after converting to a VPatch), instead of accepting the default orientation.

    labelled draw (alignXaxis (a,b) (makeVP g))

Another use of VPatches is to share the vertex location map when drawing only subsets of the faces (see Overlaid examples in the next section).

4. Drawing in More Detail

Class Drawable

There is a class Drawable with instances Tgraph, VPatch, Patch. When the token B is in scope standing for a fixed backend then we can assume

    draw   :: Drawable a => a -> Diagram B  -- draws non-join edges
    drawj  :: Drawable a => a -> Diagram B  -- as with draw but also draws dashed join edges
    fillDK :: Drawable a => Colour Double -> Colour Double -> a -> Diagram B -- fills with colours

where fillDK clr1 clr2 will fill darts with colour clr1 and kites with colour clr2 as well as drawing non-join edges.

These are the main drawing tools. However they are actually defined for any suitable backend b so have more general types.

(Update Sept 2024) As of version 1.1 of PenroseKiteDart, these will be

    draw ::   (Drawable a, OKBackend b) =>
              a -> Diagram b
    drawj ::  (Drawable a, OKBackend) b) =>
              a -> Diagram b
    fillDK :: (Drawable a, OKBackend b) =>
              Colour Double -> Colour Double -> a -> Diagram b

where the class OKBackend is a check to ensure a backend is suitable for drawing 2D tilings with or without labels.

In these notes we will generally use the simpler description of types using B for a fixed chosen backend for the sake of clarity.

The drawing tools are each defined via the class function drawWith using Piece drawing functions.

    class Drawable a where
        drawWith :: (Piece -> Diagram B) -> a -> Diagram B
    
    draw = drawWith drawPiece
    drawj = drawWith dashjPiece
    fillDK clr1 clr2 = drawWith (fillPieceDK clr1 clr2)

To design a new drawing function, you only need to implement a function to draw a Piece, (let us call it newPieceDraw)

    newPieceDraw :: Piece -> Diagram B

This can then be elevated to draw any Drawable (including Tgraphs, VPatches, and Patches) by applying the Drawable class function drawWith:

    newDraw :: Drawable a => a -> Diagram B
    newDraw = drawWith newPieceDraw

Class DrawableLabelled

Class DrawableLabelled is defined with instances Tgraph and VPatch, but Patch is not an instance (because this does not retain vertex label information).

    class DrawableLabelled a where
        labelColourSize :: Colour Double -> Measure Double -> (Patch -> Diagram B) -> a -> Diagram B

So labelColourSize c m modifies a Patch drawing function to add labels (of colour c and size measure m). Measure is defined in Diagrams.Prelude with pre-defined measures tiny, verySmall, small, normal, large, veryLarge, huge. For most of our diagrams of Tgraphs, we use red labels and we also find small is a good default size choice, so we define

    labelSize :: DrawableLabelled a => Measure Double -> (Patch -> Diagram B) -> a -> Diagram B
    labelSize = labelColourSize red

    labelled :: DrawableLabelled a => (Patch -> Diagram B) -> a -> Diagram B
    labelled = labelSize small

and then labelled draw, labelled drawj, labelled (fillDK clr1 clr2) can all be used on both Tgraphs and VPatches as well as (for example) labelSize tiny draw, or labelCoulourSize blue normal drawj.

Further drawing functions

There are a few extra drawing functions built on top of the above ones. The function smart is a modifier to add dashed join edges only when they occur on the boundary of a Tgraph

    smart :: (VPatch -> Diagram B) -> Tgraph -> Diagram B

So smart vpdraw g will draw dashed join edges on the boundary of g before applying the drawing function vpdraw to the VPatch for g. For example the following all draw dashed join edges only on the boundary for a Tgraph g

    smart draw g
    smart (labelled draw) g
    smart (labelSize normal draw) g

When using labels, the function rotateBefore allows a Tgraph to be drawn rotated without rotating the labels.

    rotateBefore :: (VPatch -> a) -> Angle Double -> Tgraph -> a
    rotateBefore vpdraw angle = vpdraw . rotate angle . makeVP

So for example,

    rotateBefore (labelled draw) (90@@deg) g

makes sense for a Tgraph g. Of course if there are no labels we can simply use

    rotate (90@@deg) (draw g)

Similarly alignBefore allows a Tgraph to be aligned on the X-axis using a pair of vertex numbers before drawing.

    alignBefore :: (VPatch -> a) -> (Vertex,Vertex) -> Tgraph -> a
    alignBefore vpdraw (a,b) = vpdraw . alignXaxis (a,b) . makeVP

So, for example, if Tgraph g has vertices a and b, both

    alignBefore draw (a,b) g
    alignBefore (labelled draw) (a,b) g

make sense. Note that the following examples are wrong. Even though they type check, they re-orient g without repositioning the boundary joins.

    smart (labelled draw . rotate angle) g      -- WRONG
    smart (labelled draw . alignXaxis (a,b)) g  -- WRONG

Instead use

    smartRotateBefore (labelled draw) angle g
    smartAlignBefore (labelled draw) (a,b) g

where

    smartRotateBefore :: (VPatch -> Diagram B) -> Angle Double -> Tgraph -> Diagram B
    smartAlignBefore  :: (VPatch -> Diagram B) -> (Vertex,Vertex) -> Tgraph -> Diagram B

are defined using

    restrictSmart :: Tgraph -> (VPatch -> Diagram B) -> VPatch -> Diagram B

Here, restrictSmart g vpdraw vp uses the given vp for drawing boundary joins and drawing faces of g (with vpdraw) rather than converting g to a new VPatch. This assumes vp has locations for vertices in g.

Overlaid examples (location map sharing)

The function

    drawForce :: Tgraph -> Diagram B

will (smart) draw a Tgraph g in red overlaid (using <>) on the result of force g as in figure 6. Similarly

    drawPCompose  :: Tgraph -> Diagram B

applied to a Tgraph g will draw the result of a partial composition of g as in figure 7. That is a drawing of compose g but overlaid with a drawing of the remainder faces of g shown in pale green.

Both these functions make use of sharing a vertex location map to get correct alignments of overlaid diagrams. In the case of drawForce g, we know that a VPatch for force g will contain all the vertex locations for g since force only adds to a Tgraph (when it succeeds). So when constructing the diagram for g we can use the VPatch created for force g instead of starting afresh. Similarly for drawPCompose g the VPatch for g contains locations for all the vertices of compose g so compose g is drawn using the the VPatch for g instead of starting afresh.

The location map sharing is done with

    subVP :: VPatch -> [TileFace] -> VPatch

so that subVP vp fcs is a VPatch with the same vertex locations as vp, but replacing the faces of vp with fcs. [Of course, this can go wrong if the new faces have vertices not in the domain of the vertex location map so this needs to be used with care. Any errors would only be discovered when a diagram is created.]

For cases where labels are only going to be drawn for certain faces, we need a version of subVP which also gets rid of vertex locations that are not relevant to the faces. For this situation we have

    restrictVP:: VPatch -> [TileFace] -> VPatch

which filters out un-needed vertex locations from the vertex location map. Unlike subVP, restrictVP checks for missing vertex locations, so restrictVP vp fcs raises an error if a vertex in fcs is missing from the keys of the vertex location map of vp.

5. Forcing in More Detail

The force rules

The rules used by our force algorithm are local and derived from the fact that there are seven possible vertex types as depicted in figure 8.

Figure 8: Seven vertex types
Figure 8: Seven vertex types

Our rules are shown in figure 9 (omitting mirror symmetric versions). In each case the TileFace shown yellow needs to be added in the presence of the other TileFaces shown.

Figure 9: Rules for forcing
Figure 9: Rules for forcing

Main Forcing Operations

To make forcing efficient we convert a Tgraph to a BoundaryState to keep track of boundary information of the Tgraph, and then calculate a ForceState which combines the BoundaryState with a record of awaiting boundary edge updates (an update map). Then each face addition is carried out on a ForceState, converting back when all the face additions are complete. It makes sense to apply force (and related functions) to a Tgraph, a BoundaryState, or a ForceState, so we define a class Forcible with instances Tgraph, BoundaryState, and ForceState.

This allows us to define

    force :: Forcible a => a -> a
    tryForce :: Forcible a => a -> Try a

The first will raise an error if a stuck tiling is encountered. The second uses a Try result which produces a Left string for failures and a Right a for successful result a.

There are several other operations related to forcing including

    stepForce :: Forcible a => Int -> a -> a
    tryStepForce  :: Forcible a => Int -> a -> Try a

    addHalfDart, addHalfKite :: Forcible a => Dedge -> a -> a
    tryAddHalfDart, tryAddHalfKite :: Forcible a => Dedge -> a -> Try a

The first two force (up to) a given number of steps (=face additions) and the other four add a half dart/kite on a given boundary edge.

Update Generators

An update generator is used to calculate which boundary edges can have a certain update. There is an update generator for each force rule, but also a combined (all update) generator. The force operations mentioned above all use the default all update generator (defaultAllUGen) but there are more general (with) versions that can be passed an update generator of choice. For example

    forceWith :: Forcible a => UpdateGenerator -> a -> a
    tryForceWith :: Forcible a => UpdateGenerator -> a -> Try a

In fact we defined

    force = forceWith defaultAllUGen
    tryForce = tryForceWith defaultAllUGen

We can also define

    wholeTiles :: Forcible a => a -> a
    wholeTiles = forceWith wholeTileUpdates

where wholeTileUpdates is an update generator that just finds boundary join edges to complete whole tiles.

In addition to defaultAllUGen there is also allUGenerator which does the same thing apart from how failures are reported. The reason for keeping both is that they were constructed differently and so are useful for testing.

In fact UpdateGenerators are functions that take a BoundaryState and a focus (list of boundary directed edges) to produce an update map. Each Update is calculated as either a SafeUpdate (where two of the new face edges are on the existing boundary and no new vertex is needed) or an UnsafeUpdate (where only one edge of the new face is on the boundary and a new vertex needs to be created for a new face).

    type UpdateGenerator = BoundaryState -> [Dedge] -> Try UpdateMap
    type UpdateMap = Map.Map Dedge Update
    data Update = SafeUpdate TileFace 
                | UnsafeUpdate (Vertex -> TileFace)

Completing (executing) an UnsafeUpdate requires a touching vertex check to ensure that the new vertex does not clash with an existing boundary vertex. Using an existing (touching) vertex would create a crossing boundary so such an update has to be blocked.

Forcible Class Operations

The Forcible class operations are higher order and designed to allow for easy additions of further generic operations. They take care of conversions between Tgraphs, BoundaryStates and ForceStates.

    class Forcible a where
      tryFSOpWith :: UpdateGenerator -> (ForceState -> Try ForceState) -> a -> Try a
      tryChangeBoundaryWith :: UpdateGenerator -> (BoundaryState -> Try BoundaryChange) -> a -> Try a
      tryInitFSWith :: UpdateGenerator -> a -> Try ForceState

For example, given an update generator ugen and any f:: ForceState -> Try ForceState , then f can be generalised to work on any Forcible using tryFSOpWith ugen f. This is used to define both tryForceWith and tryStepForceWith.

We also specialize tryFSOpWith to use the default update generator

    tryFSOp :: Forcible a => (ForceState -> Try ForceState) -> a -> Try a
    tryFSOp = tryFSOpWith defaultAllUGen

Similarly given an update generator ugen and any f:: BoundaryState -> Try BoundaryChange , then f can be generalised to work on any Forcible using tryChangeBoundaryWith ugen f. This is used to define tryAddHalfDart and tryAddHalfKite.

We also specialize tryChangeBoundaryWith to use the default update generator

    tryChangeBoundary :: Forcible a => (BoundaryState -> Try BoundaryChange) -> a -> Try a
    tryChangeBoundary = tryChangeBoundaryWith defaultAllUGen

Note that the type BoundaryChange contains a resulting BoundaryState, the single TileFace that has been added, a list of edges removed from the boundary (of the BoundaryState prior to the face addition), and a list of the (3 or 4) boundary edges affected around the change that require checking or re-checking for updates.

The class function tryInitFSWith will use an update generator to create an initial ForceState for any Forcible. If the Forcible is already a ForceState it will do nothing. Otherwise it will calculate updates for the whole boundary. We also have the special case

    tryInitFS :: Forcible a => a -> Try ForceState
    tryInitFS = tryInitFSWith defaultAllUGen

Efficient chains of forcing operations.

Note that (force . force) does the same as force, but we might want to chain other force related steps in a calculation.

For example, consider the following combination which, after decomposing a Tgraph, forces, then adds a half dart on a given boundary edge (d) and then forces again.

    combo :: Dedge -> Tgraph -> Tgraph
    combo d = force . addHalfDart d . force . decompose

Since decompose:: Tgraph -> Tgraph, the instances of force and addHalfDart d will have type Tgraph -> Tgraph so each of these operations, will begin and end with conversions between Tgraph and ForceState. We would do better to avoid these wasted intermediate conversions working only with ForceStates and keeping only those necessary conversions at the beginning and end of the whole sequence.

This can be done using tryFSOp. To see this, let us first re-express the forcing sequence using the Try monad, so

    force . addHalfDart d . force

becomes

    tryForce <=< tryAddHalfDart d <=< tryForce

Note that (<=<) is the Kliesli arrow which replaces composition for Monads (defined in Control.Monad). (We could also have expressed this right to left sequence with a left to right version tryForce >=> tryAddHalfDart d >=> tryForce). The definition of combo becomes

    combo :: Dedge -> Tgraph -> Tgraph
    combo d = runTry . (tryForce <=< tryAddHalfDart d <=< tryForce) . decompose

This has no performance improvement, but now we can pass the sequence to tryFSOp to remove the unnecessary conversions between steps.

    combo :: Dedge -> Tgraph -> Tgraph
    combo d = runTry . tryFSOp (tryForce <=< tryAddHalfDart d <=< tryForce) . decompose

The sequence actually has type Forcible a => a -> Try a but when passed to tryFSOp it specialises to type ForceState -> Try ForseState. This ensures the sequence works on a ForceState and any conversions are confined to the beginning and end of the sequence, avoiding unnecessary intermediate conversions.

A limitation of forcing

To avoid creating touching vertices (or crossing boundaries) a BoundaryState keeps track of locations of boundary vertices. At around 35,000 face additions in a single force operation the calculated positions of boundary vertices can become too inaccurate to prevent touching vertex problems. In such cases it is better to use

    recalibratingForce :: Forcible a => a -> a
    tryRecalibratingForce :: Forcible a => a -> Try a

These work by recalculating all vertex positions at 20,000 step intervals to get more accurate boundary vertex positions. For example, 6 decompositions of the kingGraph has 2,906 faces. Applying force to this should result in 53,574 faces but will go wrong before it reaches that. This can be fixed by calculating either

    recalibratingForce (decompositions kingGraph !!6)

or using an extra force before the decompositions

    force (decompositions (force kingGraph) !!6)

In the latter case, the final force only needs to add 17,864 faces to the 35,710 produced by decompositions (force kingGraph) !!6.

6. Advanced Operations

Guided comparison of Tgraphs

Asking if two Tgraphs are equivalent (the same apart from choice of vertex numbers) is a an np-complete problem. However, we do have an efficient guided way of comparing Tgraphs. In the module Tgraph.Rellabelling we have

    sameGraph :: (Tgraph,Dedge) -> (Tgraph,Dedge) -> Bool

The expression sameGraph (g1,d1) (g2,d2) asks if g2 can be relabelled to match g1 assuming that the directed edge d2 in g2 is identified with d1 in g1. Hence the comparison is guided by the assumption that d2 corresponds to d1.

It is implemented using

    tryRelabelToMatch :: (Tgraph,Dedge) -> (Tgraph,Dedge) -> Try Tgraph

where tryRelabelToMatch (g1,d1) (g2,d2) will either fail with a Left report if a mismatch is found when relabelling g2 to match g1 or will succeed with Right g3 where g3 is a relabelled version of g2. The successful result g3 will match g1 in a maximal tile-connected collection of faces containing the face with edge d1 and have vertices disjoint from those of g1 elsewhere. The comparison tries to grow a suitable relabelling by comparing faces one at a time starting from the face with edge d1 in g1 and the face with edge d2 in g2. (This relies on the fact that Tgraphs are connected with no crossing boundaries, and hence tile-connected.)

The above function is also used to implement

    tryFullUnion:: (Tgraph,Dedge) -> (Tgraph,Dedge) -> Try Tgraph

which tries to find the union of two Tgraphs guided by a directed edge identification. However, there is an extra complexity arising from the fact that Tgraphs might overlap in more than one tile-connected region. After calculating one overlapping region, the full union uses some geometry (calculating vertex locations) to detect further overlaps.

Finally we have

    commonFaces:: (Tgraph,Dedge) -> (Tgraph,Dedge) -> [TileFace]

which will find common regions of overlapping faces of two Tgraphs guided by a directed edge identification. The resulting common faces will be a sub-collection of faces from the first Tgraph. These are returned as a list as they may not be a connected collection of faces and therefore not necessarily a Tgraph.

Empires and SuperForce

In Empires and SuperForce we discussed forced boundary coverings which were used to implement both a superForce operation

    superForce:: Forcible a => a -> a

and operations to calculate empires.

We will not repeat the descriptions here other than to note that

    forcedBoundaryECovering:: Tgraph -> [Tgraph]

finds boundary edge coverings after forcing a Tgraph. That is, forcedBoundaryECovering g will first force g, then (if it succeeds) finds a collection of (forced) extensions to force g such that

  • each extension has the whole boundary of force g as internal edges.
  • each possible addition to a boundary edge of force g (kite or dart) has been included in the collection.

(possible here means – not leading to a stuck Tgraph when forced.) There is also

    forcedBoundaryVCovering:: Tgraph -> [Tgraph]

which does the same except that the extensions have all boundary vertices internal rather than just the boundary edges.

Combinations and Explicitly Forced

We introduced a new type Forced (in v 1.3) to enable a forcible to be explictily labelled as being forced. For example

    forceF    :: Forcible a => a -> Forced a 
    tryForceF :: Forcible a => a -> Try (Forced a)
    forgetF   :: Forced a -> a

This allows us to restrict certain functions which expect a forced argument by making this explicit.

    composeF :: Forced Tgraph -> Forced Tgraph

The definition makes use of theorems established in Graphs,Kites and Darts and Theorems that composing a forced Tgraph does not require a check (for connectedness and no crossing boundaries) and the result is also forced. This can then be used to define efficient combinations such as

    compForce:: Tgraph -> Forced Tgraph      -- compose after forcing
    composeForce = composeF . forceF

    allCompForce:: Tgraph -> [Forced Tgraph] -- iterated (compose after force) while not emptyTgraph
    maxCompForce:: Tgraph -> Forced Tgraph   -- last item in allCompForce (or emptyTgraph)

Tracked Tgraphs

The type

    data TrackedTgraph = TrackedTgraph
       { tgraph  :: Tgraph
       , tracked :: [[TileFace]] 
       } deriving Show

has proven useful in experimentation as well as in producing artwork with darts and kites. The idea is to keep a record of sub-collections of faces of a Tgraph when doing both force operations and decompositions. A list of the sub-collections forms the tracked list associated with the Tgraph. We make TrackedTgraph an instance of class Forcible by having force operations only affect the Tgraph and not the tracked list. The significant idea is the implementation of

    decomposeTracked :: TrackedTgraph -> TrackedTgraph

Decomposition of a Tgraph involves introducing a new vertex for each long edge and each kite join. These are then used to construct the decomposed faces. For decomposeTracked we do the same for the Tgraph, but when it comes to the tracked collections, we decompose them re-using the same new vertex numbers calculated for the edges in the Tgraph. This keeps a consistent numbering between the Tgraph and tracked faces, so each item in the tracked list remains a sub-collection of faces in the Tgraph.

The function

    drawTrackedTgraph :: [VPatch -> Diagram B] -> TrackedTgraph -> Diagram B

is used to draw a TrackedTgraph. It uses a list of functions to draw VPatches. The first drawing function is applied to a VPatch for any untracked faces. Subsequent functions are applied to VPatches for the tracked list in order. Each diagram is beneath later ones in the list, with the diagram for the untracked faces at the bottom. The VPatches used are all restrictions of a single VPatch for the Tgraph, so will be consistent in vertex locations. When labels are used, there is also a drawTrackedTgraphRotated and drawTrackedTgraphAligned for rotating or aligning the VPatch prior to applying the drawing functions.

Note that the result of calculating empires (see Empires and SuperForce ) is represented as a TrackedTgraph. The result is actually the common faces of a forced boundary covering, but a particular element of the covering (the first one) is chosen as the background Tgraph with the common faces as a tracked sub-collection of faces. Hence we have

    empire1, empire2 :: Tgraph -> TrackedTgraph
    
    drawEmpire :: TrackedTgraph -> Diagram B

Figure 10 was also created using TrackedTgraphs.

Figure 10: Using a TrackedTgraph for drawing
Figure 10: Using a TrackedTgraph for drawing

7. Other Reading

Previous related blogs are:

  • Diagrams for Penrose Tiles – the first blog introduced drawing Pieces and Patches (without using Tgraphs) and provided a version of decomposing for Patches (decompPatch).
  • Graphs, Kites and Darts intoduced Tgraphs. This gave more details of implementation and results of early explorations. (The class Forcible was introduced subsequently).
  • Empires and SuperForce – these new operations were based on observing properties of boundaries of forced Tgraphs.
  • Graphs,Kites and Darts and Theorems established some important results relating force, compose, decompose.

by readerunner at June 15, 2025 03:32 PM

June 11, 2025

Simon Marlow

Browsing Stackage with VS Code and Glean

Browsing Stackage with VS Code and Glean

Have you ever wished you could browse all the Haskell packages together in your IDE, with full navigation using go-to-definition and find-references? Here’s a demo of something I hacked together while at ZuriHac 2025 over the weekend:

In the previous post I talked about how to index all of Hackage (actually Stackage, strictly speaking, because it’s not in general possible to build all of Hackage together) using Glean. Since that post I made some more progress on the indexer:

  • The indexer now indexes types. You can see type-on-hover working in the demo. The types are similar to what you see in the Haddock-generated hyperlinked source, except that here it’s always using the type of the definition and not the type at the usage site, which might be more specific. That’s a TODO for later.

  • Fixed a bunch of things, enriched the index with details about constructors, fields and class methods, and made indexing more efficient.

The DB size including types is now about 850MB, and it takes just under 8 minutes on my 9-year-old laptop to index the nearly 3000 packages in my stackage LTS 21.21 snapshot. (Note: the figures here were updated on 12-06-2025 when I redid the measurments).

Hooking it up to VS Code

The architecture looks like this:

The LSP server is a modified version of static-ls, which is already designed to provide an LSP service based on static information. I just reimplemented a few of its handlers to make calls to Glass instead of the existing hie/hiedb implementations. You can see the changes on my fork of static-ls. Of course, these changes are still quite hacky and not suitable for upstreaming.

Glass is a “Language-agnostic Symbol Server”. Essentially it provides an API abstraction over Glean with operations that are useful for code navigation and search.

Where to next?

There remain a few issues to solve before this can be useful.

  • Make Glean more easily installable. There’s a general concensus that cabal install glean would lower the barrier to entry significantly; in order to do this we need to build the folly dependency using Cabal.

  • Clean up and ship the LSP server, somehow. Once Glean is cabal-installable, we can depend on it from an LSP server package.

  • Think about continuous integration to build the Glean DB. Perhaps this can piggyback off the stackage CI infra? If we can already build a complete stackage snapshot, and Glean is easily installable, then indexing would be fairly straightforward. I’d love to hear suggestions on how best to do this.

And looking forwards a bit further:

  • Think about how to handle multiple packages versions. There’s no fundamental problem with indexing multiple package versions, except that Glass’s SymbolID format currently doesn’t include the package version but that’s easily fixable. We could for example build multiple stackage LTS instances and index them all in a single Glean DB. There would be advantages to doing this, if for instance there were packages in common between two Stackage instances then the Glean DB would only contain a single copy. A lot of the type structure would be shared too.

  • Provide search functionality in the LSP. Glean can provide simple textual search for names, and with some work could also provide Hoogle-like type search.

  • Think about how to index local projects and local changes. Glean supports stacked and incremental DBs, so we could build a DB for a local project stacked on top of the full Stackage DB. You would be able to go-to-definition directly from a file in your project to the packages it depends on in Stackage. We could re-index new .hie files as they are generated, rather like how static-ls currently handles changes.

  • Integrate with HLS? Perhaps Glean could be used to handle references outside of the current project, switching seamlessly from GHC-based navigation to Glean-based navigation if you jump into a non-local package.

More use cases?

I talked with a few people at ZuriHac about potential use cases for Glean within the Haskell ecosystem. Using it in haskell.org came up a few times, as a way to power search, navigation and analysis. Also mentioned was the possibility of using it as a Hoogle backend. Potentially we could replace the Haddock-generated hyperlinked sources on haskell.org with a Glean-based browser, which would allow navigating links between packages and find-references.

Another use cases that came up was the possibility of doing impact analysis for core library changes (or any API changes really). Some of this is already possible using find-references, but more complex cases such as finding instances that override certain methods aren’t possible yet until we extend the indexer to capture richer information.

If you’re interested in using Glean for something, why not jump on the Glean discord server and tell us about it!

June 11, 2025 12:00 AM

June 02, 2025

Edward Z. Yang

Vibe coding case study: ScubaDuck

A lot of strong engineers that I know haven't really taken a serious look at AI coding; they've used LLMs to ask questions or write simple scripts and appreciate that it is a useful tool, but haven't actually tried building a nontrivial application entirely from scratch in vibe coding style (here, I use the term in its original meaning: when you do AI coding without carefully reviewing the output). This is understandable: if you're not working on a green field project, there aren't that many opportunities to write code in this style--standard practice for established projects is that someone else needs to review all of the code you write: this is a bad match for vibe coding! So in this post, I want to give a concrete case study of a nontrivial system that was entirely vibe coded (ScubaDuck), to argue the following claims:

  1. AI coding can be done on a manager's schedule: you don't need continuous blocks of coding time and context-switching is considerably less harmful. ScubaDuck was implemented in three days of part time work, where all of the work happened when the baby was napping.
  2. AI coding substantially lowers the cost of doing projects in tech stacks you are less familiar with. ScubaDuck is mostly JavaScript UI code, which is not something I write on a day-to-day basis.
  3. AI coding is an unlock for "sidequests": support software that's ancillary to your main task that is nice to have, but not essential. If previously you would have decided the cost outweighed the benefit, AI coding reducing the cost means you should redo these calculations.
  4. Vibe coding works and can produce working software. ScubaDuck is an existence proof that vibe coding is a viable strategy for generating JavaScript UI code (NB: I don't claim vibe coding will work for all domains, nor do I claim this is the only domain for it works. Hopefully you can also build some intuition for where it is more or less likely to work). You will not one shot it (ScubaDuck was 150 prompts in the end) but if you are prompting the LLM to also generate tests, you can reliably fix issues without causing regressions to existing code.
  5. Vibe coding is good for situations where buggy software is low impact; be on the lookout for ways to engineer this sort of situation. ScubaDuck is a read-only interface, where the only downside to being buggy is you can't issue the queries you want to issue.

Update: You can see all of my prompts and the resulting agent trajectories at scubaduck-prompts.

What is ScubaDuck?

ScubaDuck is a discount implementation of Meta's internal Scuba realtime database system. You can read more about what exactly this is on GitHub, but it's not so important for the purposes of this post: the key details you need to know about ScubaDuck is that it consists of a Python server that exposes an API to perform queries against a DuckDB database, and an HTML and JavaScript frontend application which implements the forms for building these queries and rendering of the output data. Both the forms and output data rendering have nontrivial JavaScript enhancements: some form inputs are chip inputs and support autocomplete, and the time series view is an SVG chart. All of these components were coded from scratch, so the project has no third-party JavaScript dependencies.

So on the one hand, this project is pretty simple. There are no stringent performance or uptime requirements, it's a pretty standard server-client program that the LLM has seen millions of times before (this is good!) On the other hand, the exact behavior of the frontend UI is quite intricate and would be very difficult to one-shot in a single prompt. Indeed, as I was coding and testing the application, I frequently ran into situations that I didn't anticipate in my original specification, and that I had to ask Codex to refine. Another way to put it is that ScubaDuck is a relatively simple functional specification (although this too was not one shot), but I did a lot of polishing of small behaviors so that the interface behaved in the way that I expected Scuba to behave. Here, it was helpful that I had a very clear idea of what I wanted (since I've used Scuba quite a lot at work).

Going into ScubaDuck, I had a pretty good sense that this project should be a good fit for LLMs. HTML, JavaScript and Python are all extremely high resource languages, and I'd heard lots of people raving about how good LLMs were at transforming wireframes and mockups into fully functional websites. It is also fully self contained and straightforward-ish to test (only "ish" because you do have to use something like Playwright to actually test the frontend UI, which honestly is a slog. But fortunately, the LLM can write the tests for you!) One design decision I made, which I didn't originally anticipate but worked out in the end, was the decision to not use any third-party JavaScript libraries. This was by accident: Python has no native of bundling third party JavaScript, but I wanted the tool to work offline. I wasn't sure if you could vibe code an SVG charting library from scratch, but apparently you can and it's quite easy!

Agent setup

ScubaDuck was implemented with OpenAI Codex in the cloud (not the CLI tool). Codex's cloud offering requires you to initialize a hermetic environment which the coding agent can execute commands in. It's pretty well known now that AI coding agents work much better if they are able to run the code they write and see if it worked or not, so this is quite an important part of the process. Unfortunately, this was somewhat time consuming trial and error to setup. I had a fairly detailed initial prompt, and what I would do was submit it to Codex, watch it fail, read over the trajectory (the agent logs) to see what happened (Codex wanted to use npm! Codex couldn't download something from the internet! Codex tried to use a package that wasn't available!) and then fixed whatever environment misconfiguration had caused it to fail, or edited AGENTS.md to instruct it to not do some behavior. According to my history, the first day of the project was spent unsuccessfully trying to get the project setup, and my first successful Codex PR only happened on May 19.

At the end of setup, I had the following:

  1. A pyproject.toml with exactly the dependencies I wanted to be used (duckdb, flask and python-dateutil), a lockfile for it (since I was using uv) and my preferred configuration for various tools (pytest, ruff). I'm a big fan of pytest-xdist for vibe coded projects, since you can prompt the LLM to write tests that will work when run in parallel and it does a pretty good job at this. Later I'd also add a pyright configuration, though initially I left it out because I saw Codex doing some strange things on account of duckdb being untyped, and I didn't want to debug it at the time (the fix, by the way, is instructing the LLM to define stubs as necessary in this case.)
  2. An AGENTS.md file with some basic instructions to try to get Codex to stop doing things I saw it doing in the initial trajectories that I didn't want it to do. Nothing fancy, just if you see Codex do something bad, tell it not to do it in AGENTS.md. A good example of this is the "There are no nested AGENTS.md files, this is the only agents file": Codex is post-trained to look for nested AGENTS.md files, but you can save a few tool calls if you tell it there aren't any. (Note: folklore for Claude 3.7 is that instruction following for this sort of rules following was not great. Word on the street is that both Codex and Claude 4 are substantially better at this. Extra note: For uv users, another notable instruction in AGENTS.md is how to activate the venv, since at time of writing I couldn't get Codex to make this happen automatically.)
  3. A setup script for the environment. This took the most debugging, because Codex runs all Internet access through a proxy and sometimes it works imperfectly.

After I got my initial prompt to generate a first draft of the application, I was able to begin vibe coding in earnest.

The Human-Agent loop

The basic vibe coding loop works like this:

  1. Interact with the application and find things that are broken
  2. Prompt the LLM to fix them
  3. Repeat

For example, after the very first PR, some very mild poking around immediately revealed the bugs fixed in #2:

There's a race condition in the current test logic for matching against table contents in run_query. Specifically, if there were previously valid results in lastResults, and for some reason Dive doesn't do anything, then we will still see the old results. The testing framework should explicitly clear lastResults before attempting an interaction.

...and #3:

Filter functionality does not work. We will first add a failing test, and then fix it. The failing test should click "Add Filter", then select "user" as the field, and then add an "alice" chip (by typing alice in the text box and pressing ENTER). Then when we dive, we should see two alice rows. Right now, NO request is issued at all when we click Dive. Diagnose and then fix the problem.

Prompt the agent to write tests. It's very helpful to prompt the agent to generate tests for whatever bugs its fixing. For frontend code, I decided to use playwright to write these tests. An example in #11:

def test_header_and_tabs(page: Any, server_url: str) -> None:
    page.goto(server_url)
    page.wait_for_selector("#order_by option", state="attached")

    header = page.text_content("#header")
    assert "sample.csv" in header
    assert "events" in header

    assert page.is_visible("#settings")
    assert page.is_hidden("#columns")
    page.click("text=Columns")
    assert page.is_visible("#columns")
    cols = page.locator("#column_list li").all_inner_texts()
    assert "timestamp" in cols
    assert "event" in cols
    page.click("text=View Settings")
    assert page.is_visible("#settings")

    btn_color = page.evaluate("getComputedStyle(document.querySelector('#dive')).backgroundColor")
    assert "rgb(0, 128, 0)" == btn_color

    sidebar_overflow = page.evaluate("getComputedStyle(document.querySelector('#sidebar')).overflowY")
    view_overflow = page.evaluate("getComputedStyle(document.querySelector('#view')).overflowY")
    assert sidebar_overflow == 'auto'
    assert view_overflow == 'auto'

This is kind of a very manual and somewhat brittle test. I doubt I would have had the patience to manually write and debug this. But the LLM will happily do it. There is also a fear that the test might not actually be testing anything for real. But because these tests are all generated from bug reports, you can prompt the model to write the test first, check that it's failing, and then fix the problem. I didn't instruct this in the above prompt, but I did instruct it for #155, for example. The trajectory shows that the LLM demonstrates it can repro the bug before fixing the issue:

I’ll add a case-insensitive test for "samples" to check server behavior with lowercase input.

(scubaduck) root@5b69743466e4:/workspace/scubaduck# python - <<'EOF'
> import json
> from scubaduck import server
>
> app = server.app
> client = app.test_client()
> payload = {
>     "table": "events",
>     "start": "2024-01-01 00:00:00",
>     "end": "2024-01-03 00:00:00",
>     "graph_type": "table",
>     "order_by": "samples",  # lowercase
>     "order_dir": "DESC",
>     "limit": 10,
>     "columns": [],
>     "group_by": ["user"],
> }
> rv = client.post("/api/query", data=json.dumps(payload), content_type="application/json")
> print('status', rv.status_code)
> print('data', rv.get_json())
> EOF
status 400
data {'error': 'Unknown column: samples'}
(scubaduck) root@5b69743466e4:/workspace/scubaduck#

I’ve found the bug happens when "Samples" is lowercase. I’ll adjust the query to handle case insensitivity and add a test for it. Time to modify server.py to fix the issue!

Beware BC-breaking changes. The LLM generated tests worked well when I didn't need to change behavior in a BC-breaking way. Example prompt:

Change time column to default to none unconditionally. (I want to default it to a time column if we find a good candidate but this threshold needs tuning and I don't have the tuning right now.)

This hit a Codex timeout:

This attempt to update the default time column led to extensive frontend test failures that require deeper changes. The environment made it difficult to fully diagnose the Playwright test behaviors within the time available, so the work could not be completed.

In standard software engineering practice, when this happens, decouple the BC compatible and BC breaking changes!

Make it so that Time Column can be set to (none). When it is set this way, the Start/End fields are hidden and we don't apply a filter on time range. (#115)

and then later instead of defaulting the time column to none, I added a heuristic to pick a column that looked like time, which picked the same column that all of the existing tests had also expected to be called with.

Refactors have to be split up. Codex's timeout means that you can't ask it to do too much in one go. Here's a prompt that timed out:

scubaduck/index.html has gotten a bit long. Let's split out some of the JS code into dedicated JS files for their functionality. Also setup the necessary Flask scaffolding to serve these JS files. I think splitting out these specific components would be good:

  • Dropdown implementation
  • Sidebar resizing
  • JS controlling the View Settings (e.g., updateDisplayTypeUI, as well as one off interactions on form elements, columns handling, filter handling, the actual Dive implementation (including query updating), reading in defaults from query string)
  • Table rendering (e.g., formatNumber, sorting)
  • Chip input implementation
  • Chart rendering (showTimeSeries)

Make changes to AGENTS.md or README.md describing the structure so you can quickly find where the components you need are

I eventually did manage the refactor by prompting Codex to individually move out the pieces I wanted to extract one-by-one. This is a place where I think Claude Code probably would have performed better.

Parallelizing tasks. As you can see from the lengths of my prompts, it does take a while to write a good prompt; you're basically writing a bug report with enough detail that the LLM can repro it and then fix it. So sometimes I would be bottlenecked on prompt writing. However, sometimes the prompts were quite short. In those cases, Codex encourages you to submit more tasks that can run in parallel. I found this worked well, and I'd sometimes have as many as five instances going (once again, rate limited by discovering problems, making designs and typing prompts!) One irritation is when the tasks end up conflicting with each other. Sometimes the conflicts are easy to fix, but if it feels nontrivial, it's often better to just ask Codex to redo one of the PRs on latest main after the other has landed. To avoid merge conflicts, it helps to have only one "main feature" agent going at any time, and then ask the agent to do random bugfixes in parallel with it. Once you have no more tasks to get running, you can go do something else while you wait for the agents to finish (manager schedule!)

Prompting

As a reminder, I've posted all of my prompts (including the ones that failed) at scubaduck-prompts, and I think it's helpful to skim through them to get a flavor of what I was asking the LLM. But to summarize, what did I spend most of my time on prompting Codex to do? My general vibe (ahem) is that I spent most of my time doing minor enhancements, where I instructed Codex to make some part of the program work slightly differently, in a way that was previously unspecified from the previous prompt. The metaphor I had in my head while I was working on the project was like that of a sculptor chiseling away marble: in the beginning, anything is possible, but as I kept prompting, I continuously narrowed down the space of possible programs I had until I had exactly the one I wanted. One big thing I want to note is that Codex rarely needed to make updates to my tests; for the most part, tests that were added never got taken away, because I never "changed my mind". I suspect that the vibe coding process would have been rockier if I was having to change behavior frequently.

One of the things that surprised me the most about the process was how easy it was to implement a line chart in SVG with Codex. My first prompt resulted in a chart that looked broken on the test data:

We're going to add a new View type, to go along with Samples and Table: Time Series. Time Series supports all the fields that Table supports, and a few more:

  • X-axis: Main group by dimension, e.g., the x-axis on time series view. This is our custom dropdown selector, but only time columns are populated here. It should prefer a default setting from the following list, most preferred first: "time", "timestamp"
  • Granularity: Choose the time interval between data points on the chart. For example, a granularity of 1 hour means there will be a data point every 60 minutes that is aggregated with the chosen Aggregate function over the data for the granularity period before point. This is a plain drop down. The valid values are: Auto, Fine, 1 second, 5 seconds, 10 seconds, 30 seconds, 1 minute, 4 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour, 3 hours, 6 hours, 1 day, 1 week, 30 days. The semantics of the Auto setting is that it sets the interval to whatever would result in maximum 100 buckets (if there are not enough data points for that many buckets, it just picks the finest time interval that makes sense), and Fine which sets the interval to 500 buckets.
  • Fill Missing Buckets: This is a dropdown. For now, it has the settings "Fill with 0 (Per Series)" (default), "Connect (Per Series)" and "Leave blank".

Additionally, the default setting of Limit is 7, as it controls how many elements from group by will be plotted (the actual number of lines plotted could be a multiple of this, as we will plot every selected Column).

Unlike Samples and Table, we will instead display a line chart in the right panel. To plot the line chart, we will implement it by hand with JS and SVG, similar to how highcharts implements it. We will not use any third party dependencies. Lines will be plotted as paths, no smoothing, no dots for individual data points. Each series (as generated by group by) should be plotted with a different color, assigned using a best practices color palette for graph design. There should be a rendering of x-axis and y-axis; the x-axis should have slanted labels to aid readability. When we mouse over the chart, a vertical line should snap to the center of the time bucket that we are closest to. We should also display a crosshair on all of the series showing us their values at that data point, and highlight the closest point we are on, and increase the thickness of the series that point is on. To the left of the graph (still in the right panel), there should be a legend. The legend looks like this:

[GROUP BY VALUE] [AGGREGATE]
[First Column name, with series color]
[Number of samples for the first column]
[Second Column name, with series color]
[Number of samples for the second column]
... for all columns
----
... for all group by values (up to the limit)

So for example, if I group by user, I might see:

Alice AVG
value
4 (samples)

The highlighted series (which has a thicker line) should also be highlighted in the legend).

This was kind of terrifying, because I initially thought I didn't have a good way to test the SVG outputs. But after doing some regular old-fashioned debugging and reading the code (yes, this part not vibe coded), I figured out the problem, and also realized that Playwright can test that an SVG path is not just entirely straight. After the initial bugs were fixed, I mostly had to add missing features like x-axis/y-axis and interactivity features (amusingly, Codex ignored most of the instructions in the latter half of the prompt, giving only the barest bones legend. I suspect this was because I had some files which were too long). My general take after this was that JS chart libraries are going to become obsolete: it's much easier to vibe code a bespoke implementation and then customize the heck out of it.

Conclusion

ScubaDuck was implemented in about 150 Codex prompts. As you can see from the sample prompts above, the prompts are recognizably programming, they just happen to be in plain English language. This is a big help, because I never had to keep track of the nest of callbacks and state machines for implementing complex UI elements in JavaScript. I had to be fluent in what I wanted my program to do, and a good QA tester for the application to discover new problems that needed to be fixed, but I did not have to worry at all about the vagaries of SVG DOM elements or pixel position computation minutiae. It's hard to say how long it would have taken to code this by hand, but I think reproducing a UI that's been in production for years at Meta in three (part-time) days is pretty good!

Despite having done a bit of AI coding before, I also learned a bit from working on Codex. Codex made it blindingly clear that the parallel modality (and subsequent conflict resolution) is important. It made me adjust up my estimation of the capability of LLMs to write raw HTML/JS and evoked a future where people vibe code components in place of taking on a third party dependency. I was very appreciative of no rate limit Codex (though I doubt it's going to last.) It also reminded me how difficult it will be to setup agent environments for "real" projects (like PyTorch).

Hopefully, this case study has given you some ideas for things to try. Go forth and vibe code, responsibly!

by Edward Z. Yang at June 02, 2025 04:31 AM

Chris Penner

Building Industrial Strength Software without Unit Tests

Building Industrial Strength Software without Unit Tests

I don't know about you, but testing isn't my favourite part of software development.

It's usually the last thing standing between me and shipping a shiny new feature, and writing tests is often an annoying process with a lot of boilerplate and fighting against your system to get your app into a good start starting for the test or mocking out whichever services your app depends on.

Much ink has been spilled about how to organize your code in order to make this easier, but the fact that so many blog posts and frameworks exist for this express purpose suggests to me that we as a community of software developers haven't quite solved this issue yet.

Keep reading to see how I've solved this problem for myself by simply avoiding unit testing altogether.

An alternative testing method

When I first started at Unison Computing I was submitting my first feature when I learned there were precious few unit tests. I found it rather surprising for a codebase for a compiler for a programming language! How do you prevent regressions without unit tests?

The answer is what the Unison team has dubbed transcript tests. These are a variation on the concept of golden-file tests.

A Unison transcript is a markdown file which explains in standard what behaviour it is going to test, then intersperses code-blocks which outline the steps involved in testing that feature using a mix of Unison code and UCM commands (UCM is Unison's CLI tool). After that comes the magic trick; UCM itself can understand and run these transcript files directly and record the results of each block.

When running a transcript file with the ucm transcript command UCM produces a deterministic output file containing the result of processing each code block. Unless the behaviour of UCM has changed since the last time it was run the resulting file will always be the same.

Each block in the markdown file is either a command, which is sent to the UCM shell tool, or it represents an update to a file on the (virtual) file-system, in which case it will be typechecked against the state of the codebase.

Here's a quick example of a transcript for testing UCM's view command so you can get a feel for it.

# Testing the `view` command

First, let's write a simple definition to view:

``` unison
isZero = cases
  0 -> true
  _ -> false
```

Now we add the definition to the codebase, and view it.

``` ucm
scratch/main> update
scratch/main> view isZero
```

We run this transcript file with ucm transcript my-transcript.md which produces the my-transcript.output.md file.

Notice how compiler output is added inline, ignore the hashed names, It's because I'm skipping the step which adds names for Unison's builtins.

# Testing the `view` command

First, let's write a simple definition to view:

``` unison
isZero = cases
  0 -> true
  _ -> false
```

``` ucm :added-by-ucm
  Loading changes detected in scratch.u.

  I found and typechecked these definitions in scratch.u. If you
  do an `add` or `update`, here's how your codebase would
  change:

    � These new definitions are ok to `add`:
    
      isZero : ##Nat -> ##Boolean
```

Now we add the definition to the codebase, and view it.

``` ucm
scratch/main> update

  Done.

scratch/main> view isZero

  isZero : ##Nat -> ##Boolean
  isZero = cases
    0 -> true
    _ -> false
```

Feel free to browse through the collection of transcripts we test in CI to keep UCM working as expected.

Testing in CI

Running transcript tests in CI is pretty trivial; we discover all markdown files within our transcript directory and run them all. After the outputs have been written we can use git diff --exit-code which will then fail with a non-zero code if anything of the outputs have changed from what was committed. Conveniently, git will also report exactly what changed, and what the old output was.

This failure method allows the developer to know exactly which file has unexpected behaviour so they can easily re-run that file or recreate the state in their own codebase if they desire.

Transcript tests in other domains

I liked the transcript tests in UCM so much that when I was tasked with building out the Unison Share webapp I decided to use transcript-style testing for that too. Fast forward a few years and Unison Share is now a fully-featured package repository and code collaboration platform running in production without a single unit test.

If you're interested in how I've adapted transcript tests to work well for a webapp, I'll leave a few notes at the end of the post.

Benefits of transcript tests

Here's a shortlist of benefits I've found working with transcript tests over alternatives like unit tests.

You write a transcript using the same syntax as you'd interact with UCM itself.

This allows all your users to codify any buggy behaviour they've encountered into a deterministic transcript. Knowing exactly how to reproduce the behaviour your users are seeing is a huge boon, and having a single standardized format for accepting bug reports helps reduce a lot of the mental work that usually goes into reproducing bug reports from a variety of sources. This also means that the bug report itself can go directly into the test suite if we so desire.

All tests are written against the tool's external interface.

The tests use the same interface that the users of your software will employ, which means that internal refactors won't ever break tests unless there's a change in behaviour that's externally observable.

This has been a huge benefit for me personally. I'd often find myself hesitant to re-work code because I knew that at the end I'd be rewriting thousands of lines of tests. If you always have to rewrite your tests at the same time you've rewritten your code, how do you have any confidence that the tests still work as intended?

Updating tests is trivial

In the common case where transcripts are mismatched because some help message was altered, or perhaps the behaviour has changed but the change is intended, you don't need to rewrite any complex assertions, or mock out any new dependencies. You can simply look at the new output, and if it's reasonable you commit the changed transcript output files.

It can't be understated how convenient this is when making sweeping changes; e.g. making changes to Unison's pretty printer. We don't need to manually update test-cases, we just run the transcripts locally and commit the output if it all looks good!

Transcript changes appear in PR reviews

Since all transcript outputs are committed, any change in behaviour will show up in the PR diff in an easy-to-read form. This allows reviewers to trivially see the old and new behaviour for each relevant feature.

Transcript tests are documentation

Each transcript shows how a feature is intended to be used by end-users.

Transcripts as a collaboration tool

When I'm implementing new features in Unison Share I need to communicate the shape of a JSON API with our Frontend designer Simon. Typically I'll just write a transcript test which exercises all possible variants of the new feature, then I can just point at the transcript output as the interface for those APIs.

It's beneficial for both of us since I don't need to keep an example up-to-date for him, and he knows that the output is actually accurate since it's generated from an execution of the service itself.

Transcript testing for Webapps

I've adapted transcript testing a bit for the Unison Share webapp. I run the standard Share executable locally with its dependencies mocked out via docker-compose. I've got a SQL file which resets the database with a known set of test fixtures, then use a zsh script to reset my application state in between running each transcript.

Each transcript file is just a zsh script that interacts with the running server using a few bash functions which wrap curl commands, but save the output to json files, which serve as the transcript output.

I've also got helpers for capturing specific fields from an API call into local variables which I can then interpolate into future queries, this is handy if you need to, for example, create a project then switch it from private to public, then fetch that project via API.

Here's a small snippet from one of my transcripts for testing Unison Share's project APIs:

#!/usr/bin/env zsh

# Fail the transcript if any command fails
set -e

# Load utility functions and variables for user credentials
source "../../transcript_helpers.sh"

# Run a UCM transcript to upload some code to load in projects.
transcript_ucm transcript prelude.md

# I should be able to see the fixture project as an unauthenticated user.
fetch "$unauthenticated_user" GET project-get-simple '/users/test/projects/publictestproject'

# I should be able to create a new project as an authenticated user.
fetch "$transcripts_user" POST project-create '/users/transcripts/projects/containers' '{
    "summary": "This is my project",
    "visibility": "private",
    "tags": []
}'

fetch "$transcripts_user" GET project-list '/users/transcripts/projects'

You can see the output files generated by the full transcript in this directory.

Requirements of a good transcript testing tool

After working with two different transcript testing tools across two different apps I've got a few criteria for what makes a good transcript testing tool, if you're thinking of adding transcript tests to your app consider the following:

Transcripts should be deterministic

This is critical. Transcripts are only useful if they produce the same result on every run, on every operating system, at every time of day.

You may need to make a few changes in your app to adapt or remove randomness, at least when in the context of a transcript test.

In Share there were a lot of timestamps, random IDs, and JWTs (which contain a timestamp). The actual values of these weren't important for the tests themselves, so I solved the issue by piping the curl output through a sed script before writing to disk. The script matches timestamps, UUIDs, and JWTs and replaces them with placeholders like <TIMESTAMP>, <UUID>, and <JWT> accordingly.

A special mode in your app for transcript testing which avoids randomness can be useful, but use custom modes sparingly lest your app's behaviour differ too much during transcripts and you can't test the real thing.

I also make sure that the data returned by APIs is always sorted by something other than randomized IDs, it's a small price to pay, and reduces randomness and heisenbugs in the app as a helpful byproduct.

Transcripts should be isolated

Each individual transcript should be run in its own pristine environment. Databases should be reset to known state, if the file-system is used, it should be cleared or even better, a virtual file-system should be used.

Transcripts should be self-contained

Everything that pertains to a given test-case's state or configuration should be evident from within the transcript file itself. I've found that changes in behaviour from the file's location or name can just end up being confusing.

Difficulties working with Transcripts

Transcripts often require custom tooling

In UCM's case the transcript tooling has evolved slowly over many years, it has it's own parser, and you can even test UCM's API server by using special code blocks for that.

Share has a variety of zsh utility scripts which provide helpers for fetching endpoints using curl, and filtering output to capture data for future calls. It also has a few tools for making database calls and assertions.

Don't shy away from investing a bit of time into making transcript testing sustainable and pleasant, it will pay dividends down the road.

Intensive Setup

As opposed to unit tests which are generally pretty lightweight; transcript tests are full integration tests, and require setting up data, and sometimes executing entire flows so that we can get the system into a good state for testing each feature.

You can mitigate the setup time by testing multiple features with each transcript.

I haven't personally found transcript tests to take too much time in CI, largely because I think transcript testing tends to produce fewer tests, but of higher value than unit testing. I've seen many unit test suites bogged down by particular unit tests which generate hundreds of test cases that aren't actually providing real value. Also, any setup/teardown is going to be more costly on thousands of unit-tests as compared to dozens or hundreds of transcript tests.

Service Mocking

Since transcript tests run against the system-under-test's external interface, you won't have traditional mocking/stubbing frameworks available to you. Instead, you'll mock out the system's dependencies by specifying custom services using environment variables, or wiring things up in docker-compose.

Most systems have a setup for local development anyways, so integrating transcript tests against it has the added benefit that they'll ensure your local development setup is tested in CI, is consistent for all members of your team, and continues to work as expected.

In Summary

Hopefully this post has helped you to consider your relationship with unit tests and perhaps think about whether other testing techniques may work better for your app.

Transcript tests surely aren't ideal for all possible apps or teams, but my last few years at Unison have proven to me that tests can be more helpful, efficient, and readable than I'd previously thought possible.

Let me know how it works out for you!

Hopefully you learned something �! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put these things together, if I managed to teach your something or even just entertain you for a minute or two maybe send a few bucks my way for a coffee? Cheers! �

Become a Patron!

June 02, 2025 12:00 AM

May 30, 2025

Haskell Interlude

65: Andy Gordon

Andy Gordon from Cogna is interviewed by Sam and Matti. We learn about Andy’s influential work including the origins of the bind symbol in haskell, and the introduction of lambdas in Excel. We go onto discuss his current work at Cogna on using AI to allow non-programmers to write apps using natural language. We delve deeper into the ethics of AI and consider the most likely AI apocalypse.

by Haskell Podcast at May 30, 2025 02:00 PM

May 28, 2025

Chris Smith 2

Threshold Strategy in Approval and Range Voting

How to turn polling insight into an optimal ballot — and why anything else is wasted.

“approve of�? What does that mean anyway?

I have written previously about how approval and range voting methods are intrinsically tactical. This doesn’t mean that they are more tactical than other election systems (nearly all of which are shown to sometimes be tactical by Gibbard’s Theorem when there are three or more options). Rather, it means that tactical voting is unavoidable. Voting in such a system requires answering the question of where to set your approval threshold or how to map your preferences to a ranged voting scale. These questions don’t have more or less “honest� answers. They are always tactical choices.

But I haven’t dug deeper into what these tactics look like. Here, I’ll do the mathematical analysis to show what effective voting looks like in these systems, and make some surprising observations along the way.

Mathematical formalism for approval voting

We’ll start by assuming an approval election, so the question is where to put your threshold. At what level of approval do you switch from voting not to approve a candidate to approving them?

We’ll keep the notation minimal:

  • As is standard in probability, I’ll write â„™[X] for the probability of an event X, and ğ�”¼[X] for the expected value of a (numerical) random variable X.
  • I will use B to refer to a random collection (multiset) of ballots, drawn from some probability distribution reflecting what we know from polling and other information sources on other voters. B will usually not include the approval vote that you’re considering casting, and to include that approval, we’ll write B ∪ {c}, where c is the candidate you contemplate approving.
  • I’ll write W(·) to indicate the winner of an election with a given set of ballots. This is the candidate with the most approvals. We’ll assume some tiebreaker is in place that’s independent of individual voting decisions; for instance, candidates could be shuffled into a random order before votes are cast, in in the event of a tie for number of approvals, we’ll pick the candidate who comes first in that shuffled order.
  • U(·) will be your utility function, so U(c) is the utility (i.e., happiness, satisfaction, or perceived social welfare) that you personally will get from candidate c winning the election. This doesn’t mean you have to be selfish, per se, as accomplishing some altruistic goal is still a form of utility, but we evaluate that utility from your point of view even though other voters may disagree.

With this notation established, we can clearly state, almost tautologically, when you should approve of a candidate c. You should approve of c whenever:

�[U(W(B ∪ {c}))] > �[U(W(B))]

That’s just saying you should approve of c if your expected utility from the election with your approval of c is more than your utility without it.

The role of pivotal votes and exact strategy

This inequality can be made more useful by isolating the circumstances in which your vote makes a difference in the outcome. That is, W(B ∪ {c}) ≠ W(B). Non-pivotal votes contribute zero to the net expectation, and can be ignored.

In approval voting, approving a candidate can only change the outcome by making that candidate the winner. This means a pivotal vote is equivalent to both of:

  • W(B ∪ {c}) = c
  • W(B) ≠ c

It’s useful to have notation for this, so we’ll define V(B, c) to mean that W(B ∪ {c}) ≠ W(B), or equivalently, that W(B ∪ {c}) = c and W(B) ≠ c. To remember this notation, recall that V is the pivotal letter in the word “pivot�, and also visually resembles a pivot.

With this in mind, the expected gain in utility from approving c is:

  • ğ�”¼[U(W(B ∪ {c}))] - ğ�”¼[U(W(B))]. But since the utility gain is zero except for pivotal votes, this is the same as
  • â„™[V(B, c)] · (ğ�”¼[U(W(B ∪ {c})) | V(B, c)] - ğ�”¼[U(W(B)) | V(B, c)]). But since V(B, c) implies that W(B ∪ {c}) = c, so this simplifies to
  • â„™[V(B, c)] · (U(c) - ğ�”¼[U(W(B)) | V(B, c)])

Therefore, you ought to approve of a candidate c whenever

U(c) > �[U(W(B)) | V(B, c)]

This is much easier to interpret. You should approve of a candidate c precisely when the utility you obtain from c winning is greater than the expected utility in cases where c is right on the verge of winning (but someone else wins instead).

There are a few observations worth making about this:

  • The expectation clarifies why the threshold setting part of approval voting is intrinsically tactical. It involves evaluating how likely each other candidate is to win, and using that information to compute an expectation. That means advice to vote only based on internal feelings like whether you consider a candidate acceptable is always wrong. An effective vote takes into account external information about how others are likely to vote, including polling and understanding of public opinion and mood.
  • The conditional expectation, assuming V(B, c), tells us that the optimal strategy for whether to approve of some candidate c depends on the very specific situation where c is right on the verge of winning the election. If c is a frontrunner in the election, this scenario isn’t likely to be too different from the general case, and the conditional probability doesn’t change much. However, if c is a long-shot candidate from some minor party, but somehow nearly ties for a win, we’re in a strange situation indeed: perhaps a major last-minute scandal, a drastic polling error, or a fundamental misunderstanding of the public mood. Here, the conditonal expected utility of an alternate winner might be quite different from your unconditional expectation. If, say, voters prove to have an unexpected appetite for extremism, this can affect the runner-ups, as well.
  • Counter-intuitively, an optimal strategy might even involve approving some candidates that you like less than some that you don’t approve! This can happen because different candidates are evaluated against different thresholds. Therefore, a single voter’s best approval ballot isn’t necessarily monotonic in their utility rankings. This adds a level of strategic complexity I hadn’t anticipated in my earlier writings on strategy in approval voting.

Approximate strategy

The strategy described above is rigorously optimal, but not at all easy to apply. Imagining the bizarre scenarios in which each candidate, no matter how minor, might tie for a win, is challenging to do well. We’re fortunate, then, that there’s a good approximation. Remember that the utility gain from approving a candidate was equal to

ℙ[V(B, c)] · (U(c) - �[U(W(B)) | V(B, c)])

In precisely the cases where V(B, c) is a bizarre assumption that’s difficult to imagine, we’re also multiplying by ℙ[V(B, c)], which is vanishingly small, so this vote is very unlikely to make a difference in the outcome. For front-runners, who are relatively much more likely to be in a tie for the win, the conditional probability changes a lot less: scenarios that end in a near-tie are not too different from the baseline expectation.

This happens because ℙ[V(B, c)] falls off quite quickly indeed as the popularity of c decreases, especially for large numbers of voters. For a national scale election (say, about 10 million voters), if c expects around 45% of approvals, then ℙ[V(B, c)] is around one in a million. That’s a small number, telling us that very large elections aren’t likely to be decided by a one-vote margin anyway. But it’s gargantuan compared to the number if c expects only 5% of approvals. Then ℙ[V(B, c)] is around one in 10^70. That’s about one in a quadrillion-vigintillion, if you want to know, and near the scale of possibly picking one atom at random from the entire universe! The probability of casting a pivotal vote drops off exponentially, and by this point it’s effectively zero.

With that in mind, we can drop the condition on the probability in the second term, giving us a new rule: Approve of a candidate c any time that:

U(c) > �[U(W(B))]

That is, approve of any candidate whose win you would like better than you expect to like the outcome of the election. In other words, imagine you have no other information on election night, and hear that this candidate has won. If this would be good news, approve of the candidate on your ballot. If it would be bad news, don’t.

  • This rule is still tactical. To determine how much you expect to like the outcome of the election, you need to have beliefs about who else is likely to win, which still requires an understanding of polling and public opinion and mood.
  • However, there is one threshold, derived from real polling data in realistic scenarios, and you can cast your approval ballot monotonically based on that single threshold.

This is no longer a true optimal strategy, but with enough voters, the exponential falloff in ℙ[V(B, c)] as c becomes less popular is a pretty good assurance that the incorrect votes you might cast by using this strategy instead of the optimal ones are extremely unlikely to matter. In practice, this is probably the best rule to communicate to voters in an approval election with moderate to large numbers of voters.

We can get closer with the following hypothetical: Imagine that on election night, you have no information on the results except for a headline that proclaims: Election Too Close To Call. With that as your prior, you ask of each candidate, is it good or bad news to hear now that this candidate has won. If it would be good news, then you approve of them. This still leaves one threshold, but we’re no longer making the leap that the pivotal condition for front-runners is unnecessary; we’re imagining a world in which at least some candidates, almost surely the front-runners, are tied. If this changes your decision (which it likely would only in very marginal cases), you can use this more accurate approximation.

Reducing range to approval voting

I promised to look at strategy for range voting, as well. Armed with an appreciation of approval strategy, it’s easy to extend this to an optimal range strategy, as well, for large-scale elections.

The key is to recognize that a range voting election with options 0, 1, 2, …, n is mathematically equivalent to an approval election where everyone is just allowed to vote n times. The number you mark on the range ballot can be interpreted as saying how many of your approval ballots you want to mark as approving that candidate.

Looking at it this way presents the obvious question: why would you vote differently on some ballots than others? In what situation could that possibly be the right choice?

  • For small elections, say if you’re voting on places to go out and eat with your friends or coworkers, it’s possible that adding in a handful of approvals materially changes the election so that the optimal vote is different. Then it may well be optimal to cast a range ballot using some intermediate number.
  • For large elections, though, you’re presented with pretty much exactly the same question each time, and you may as well give the same answer. Therefore, in large-scale elections, the optimal way to vote with a range ballot is always to rate everyone either the minimum or maximum possible score. This reduces a range election exactly to an approval election. The additional expressiveness of a range ballot is a siren call: by using it, you always vote less effectively than you would have by ignoring it and using only the two extreme choices.

Since we’re discussing political elections, which have relatively large numbers of voters, this answers the question for range elections, as well: Rate a candidate the maximum score if you like them better than you expect to like the outcome of the election. Otherwise, rate them the minimum score.

Summing it up

What we’ve learned, then, is that optimal voting in approval or range systems boils down to two nested rules.

  • Exact rule (for the mathematically fearless): approve c iff U(c) > ğ�”¼[ U(W(B)) | your extra vote for c is pivotal ]. This Bayesian test weighs each candidate against the expected utility in the razor-thin worlds where they tie for first.
  • Large-electorate shortcut (for everyone else): because those pivotal worlds become astronomically rare as the field grows, the condition shrinks to a single cutoff: approve (or give a maximum score) to every candidate whose victory you expect to enjoy more than you expected to like the result. (If you can, imagine only cases where you know the election is close.)

We’ve seen why the first rule is the gold standard; but the second captures virtually all of its benefit when millions are voting. Either way, strategy is inseparable from sincerity: you must translate beliefs about polling into a utility threshold, and then measure every candidate against it. We’ve also seen by a clear mathematical equivalence why range ballots add no real leverage in large-scale elections, instead only offering false choices that are always wrong.

The entire playbook fits on a sticky note: compute the threshold, vote all-or-nothing, and let the math do the rest.

by Chris Smith at May 28, 2025 08:25 PM

May 25, 2025

Mark Jason Dominus

Mystery of the quincunx's missing quincunx

A quincunx is the X-shaped pattern of pips on the #5 face of a die.

A square with five dots arranged in an X

It's so-called because the Romans had a common copper coin called an as, and it was divided (monetarily, not physically) into twelve uncia. There was a bronze coin worth five uncia called a quīncunx, which is a contraction of quīnque (“five”) + uncia, and the coin had that pattern of dots on it to indicate its value.

Uncia generally meant a twelfth of something. It was not just a twelfth of an as, but also a twelfth of a pound , which is where we get the word “ounce”, and a twelfth of a foot, which is where we get the word “inch”.

The story I always heard about the connection between the coin and the X-shaped pattern of dots was the one that is told by Wikipedia:

Its value was sometimes represented by a pattern of five dots arranged at the corners and the center of a square, like the pips of a die. So, this pattern also came to be called quincunx.

Or the Big Dictionary:

… [from a] coin of this value (occasionally marked with a pattern resembling the five spots on a dice cube),…

But today I did Google image search for qunicunxes. And while most had five dots, I found not even one that had the dots arranged in an X pattern.

Pictures of the two sides of an ancient, corroded, worn, weathered coin.  Each one has a four-armed cross who arms have crossbars at the ends, and the one on the right also has five dots.  The dots are in a cluster in the space between the cross's lower and left arms, and are arranged in a row of three and, closer to the center, a row of two.

Another cruddy coin. The obverse shows the head of a person, probably Minerva, wearing a plumed helmet. Above the head is a row of five dots.

This coin is covered with green oxide.  The obverse is another helmeted Minerva, surmounted by a horizontal row of five dots.  The reverse has a picture of an owl, and, on the right, a column of five dots.

(I believe the heads here are Minerva, goddess of wisdom. The owl is also associated with Minerva.)

Where's the quincunx that actually has a quincuncial arrangement of dots? Nowhere to be found, it seems. But everyone says it, so it must be true.

Addenda

  • The first common use of “quincunx” as an English word was to refer to trees that were planted in a quincuncial pattern, although not necessarily in groups of exactly five, in which each square of four trees had a fifth at its center.

  • Similarly, the Galton Box, has a quincuncial arrangement of little pegs. Galton himself called it a “quincunx”.

  • The OED also offers this fascinating aside:

    Latin quincunx occurs earlier in an English context. Compare the following use apparently with reference to a v-shaped figure:

    1545 Decusis, tenne hole partes or ten Asses...It is also a fourme in any thynge representyng the letter, X, whiche parted in the middel, maketh an other figure called Quincunx, V.

    which shows that for someone, a quincuncial shape was a V and not an X, presumably because V is the Roman numeral for five.

    A decussis was a coin worth not ten uncia but ten asses, and it did indeed have an X on the front. A five-as coin was a quincussis and it had a V. I wonder if the author was confused?

    The source is Bibliotheca Eliotæ. The OED does not provide a page number.

  • It wasn't until after I published this that I realized that today's date was the extremely quincuncial 2025-05-25. I thank the gods of chance and fortune for this little gift.

by Mark Dominus (mjd@plover.com) at May 25, 2025 11:00 PM

May 24, 2025

Mark Jason Dominus

The fivefold symmetry of the quince

The quince is so-named because, like other fruits in the apple family, it has a natural fivefold symmetry:

several greenish-yellow quinces. They are like shiny pears, but less elongated.  In the foreground, one is cut in half, to reveal five wedge-shaped hollows arranged symmetrically to form a circle, each filled with shiny brown seeds.

This is because their fruits develop from five-petaled flowers, and the symmetry persists through development. These are pear blossoms:

A small branch from a pear tree, with green leaves and white pear blossoms.  The bossoms have five petals each, against which a cluster of dark-tipped stamens contrasts.

You can see this in most apples if you cut them into equatorial slices:

Apple slices on a cutting board, each with a hole in the middle from the seed capsule in the center of the core, in the shape of a five-pointed star.

The fivefold symmetry isn't usually apparent from the outside once the structure leaves the flowering stage. But perfect Red Delicious specimens do have five little feet:

A dozen Red Delicious apples, bottoms up to show that each does have five little bumps arranged around the blossom end.

P.S.: I was just kidding about the name of the quince, which actually has nothing to do with any of this. It is a coincidence.

by Mark Dominus (mjd@plover.com) at May 24, 2025 03:29 AM

May 22, 2025

Simon Marlow

Indexing Hackage: Glean vs. hiedb

Indexing Hackage: Glean vs. hiedb

I thought it might be fun to try to use Glean to index as much of Hackage as I could, and then do some rough comparisons against hiedb and also play around to see what interesting queries we could run against a database of all the code in Hackage.

This project was mostly just for fun: Glean is not going to replace hiedb any time soon, for reasons that will become clear. Neither are we ready (yet) to build an HLS plugin that can use Glean, but hopefully this at least demonstrates that such a thing should be possible, and Glean might offer some advantages over hiedb in performance and flexibility.

A bit of background:

  • Glean is a code-indexing system that we developed at Meta. It’s used internally at Meta for a wide range of use cases, including code browsing, documentation generation and code analysis. You can read about the ways in which Glean is used at Meta in Indexing Code At Scale with Glean.

  • hiedb is a code-indexing system for Haskell. It takes the .hie files that GHC produces when given the option -fwrite-ide-info and writes the information to a SQLite database in various tables. The idea is that putting the information in a DB allows certain operations that an IDE needs to do, such as go-to-definition, to be fast.

You can think of Glean as a general-purpose system that does the same job as hiedb, but for multiple languages and with a more flexible data model. The open-source version of Glean comes with indexers for ten languages or so, and moreover Glean supports SCIP which has indexers for various languages available from SourceGraph.

Since a hiedb is just a SQLite DB with a few tables, if you want you can query it directly using SQL. However, most users will access the data through either the command-line hiedb tool or through the API, which provide the higher-level operations such as go-to-definition and find-references. Glean has a similar setup: you can make raw queries using Glean’s query language (Angle) using the Glean shell or the command-line tool, while the higher-level operations that know about symbols and references are provided by a separate system called Glass which also has a command-line tool and API. In Glean the raw data is language-specific, while the Glass interface provides a language-agnostic view of the data in a way that’s useful for tools that need to navigate or search code.

An ulterior motive

In part all of this was an excuse to rewrite Glean’s Haskell indexer. We built a Haskell indexer a while ago but it’s pretty limited in what information it stores, only capturing enough information to do go-to-definition and find-references and only for a subset of identifiers. Furthermore the old indexer works by first producing a hiedb and consuming that, which is both unnecessary and limits the information we can collect. By processing the .hie files directly we have access to richer information, and we don’t have the intermediate step of creating the hiedb which can be slow.

The rest of this post

The rest of the post is organised as follows, feel free to jump around:

  • Performance: a few results comparing hiedb with Glean on an index of all of Hackage

  • Queries: A couple of examples of queries we can do with a Glean index of Hackage: searching by name, and finding dead code.

  • Apparatus: more details on how I set everything up and how it all works.

  • What’s next: some thoughts on what we still need to add to the indexer.

Performance

All of this was perfomed on a build of 2900+ packages from Hackage, for more details see Building all of Hackage below.

Indexing performance

I used this hiedb command:

hiedb index -D /tmp/hiedb . --skip-types

I’m using --skip-types because at the time of writing I haven’t implemented type indexing in Glean’s Haskell indexer, so this should hopefully give a more realistic comparison.

This was the Glean command:

glean --service localhost:1234 \
  index haskell-hie --db stackage/0 \
  --hie-indexer $(cabal list-bin hie-indexer) \
  ~/code/stackage/dist-newstyle/build/x86_64-linux/ghc-9.4.7 \
  --src '$PACKAGE'

Time to index:

  • hiedb: 1021s
  • Glean: 470s

I should note that in the case of Glean the only parallelism is between the indexer and the server that is writing to the DB. We didn’t try to index multiple .hie files in parallel, although that would be fairly trivial to do. I suspect hiedb is also single-threaded just going by the CPU load during indexing.

Size of the resulting DB

  • hiedb: 5.2GB
  • Glean: 0.8GB

It’s quite possible that hiedb is simply storing more information, but Glean does have a rather efficient storage system based on RocksDB.

Performance of find-references

Let’s look up all the references of Data.Aeson.encode:

hiedb -D /tmp/hiedb name-refs encode Data.Aeson

This is the query using Glass:

cabal run glass-democlient -- --service localhost:12345 \
  references stackage/hs/aeson/Data/Aeson/var/encode

This is the raw query using Glean:

glean --service localhost:1234 --db stackage/0 \
  '{ Refs.file, Refs.uses[..] } where Refs : hs.NameRefs; Refs.target.occ.name = "encode"; Refs.target.mod.name = "Data.Aeson"'
  • hiedb: 2.3s
  • glean (via Glass): 0.39s
  • glean (raw query): 0.03s

(side note: hiedb found 416 references while Glean found 415. I haven’t yet checked where this discrepancy comes from.)

But these results don’t really tell the whole story.

In the case of hiedb, name-refs does a full table scan so it’s going to take time proportional to the number of refs in the DB. Glean meanwhile has indexed the references by name, so it can serve this query very efficiently. The actual query takes a few milliseconds, the main overhead is encoding and decoding the results.

The reason the Glass query takes longer than the raw Glean query is because Glass also fetches additional information about each reference, so it performs a lot more queries.

We can also do the raw hiedb query using the sqlite shell:

sqlite> select count(*) from refs where occ = "v:encode" AND mod = "Data.Aeson";
417
Run Time: real 2.038 user 1.213905 sys 0.823001

Of course hiedb could index the refs table to make this query much faster, but it’s interesting to note that Glean has already done that and it was still quicker to index and produced a smaller DB.

Performance of find-definition

Let’s find the definition of Data.Aeson.encode, first with hiedb:

$ hiedb -D /tmp/hiedb name-def encode Data.Aeson
Data.Aeson:181:1-181:7

Now with Glass:

$ cabal run glass-democlient -- --service localhost:12345 \
  describe stackage/hs/aeson/Data/Aeson/var/encode
stackage@aeson-2.1.2.1/src/Data/Aeson.hs:181:1-181:47

(worth noting that hiedb is giving the span of the identifier only, while Glass is giving the span of the whole definition. This is just a different choice; the .hie file contains both.)

And the raw query using Glean:

$ glean --service localhost:1234 query --db stackage/0 --recursive \
  '{ Loc.file, Loc.span } where Loc : hs.DeclarationLocation; N : hs.Name; N.occ.name = "encode"; N.mod.name = "Data.Aeson"; Loc.name = N' | jq
{
  "id": 18328391,
  "key": {
    "tuplefield0": {
      "id": 9781189,
      "key": "aeson-2.1.2.1/src/Data/Aeson.hs"
    },
    "tuplefield1": {
      "start": 4136,
      "length": 46
    }
  }
}

Times:

  • hiedb: 0.18s
  • Glean (via Glass): 0.05s
  • Glean (raw query): 0.01s

In fact there’s a bit of overhead when using the Glean CLI, we can get a better picture of the real query time using the shell:

stackage> { Loc.file, Loc.span } where Loc : hs.DeclarationLocation; N : hs.Name; N.occ.name = "encode"; N.mod.name = "Data.Aeson"; Loc.name = N
{
  "id": 18328391,
  "key": {
    "tuplefield0": { "id": 9781189, "key": "aeson-2.1.2.1/src/Data/Aeson.hs" },
    "tuplefield1": { "start": 4136, "length": 46 }
  }
}

1 results, 2 facts, 0.89ms, 696176 bytes, 2435 compiled bytes

The query itself takes less than 1ms.

Again, the issue with hiedb is that its data is not indexed in a way that makes this query efficient: the defs table is indexed by the pair (hieFile,occ) not occ alone. Interestingly, when the module is known it ought to be possible to do a more efficient query with hiedb by first looking up the hieFile and then using that to query defs.

What other queries can we do with Glean?

I’ll look at a couple of examples here, but really the possibilities are endless. We can collect whatever data we like from the .hie file, and design the schema around whatever efficient queries we want to support.

Search by case-insensitive prefix

Let’s search for all identifiers that start with the case-insensitive prefix "withasync":

$ glass-democlient --service localhost:12345 \
  search stackage/withasync -i | wc -l
55

In less than 0.1 seconds we find 55 such identifiers in Hackage. (the output isn’t very readable so I didn’t include it here, but for example this finds results not just in async but in a bunch of packages that wrap async too).

Case-insensitive prefix search is supported by an index that Glean produces when the DB is created. It works in the same way as efficient find-references, more details on that below.

Why only prefix and not suffix or infix? What about fuzzy search? We could certainly provide a suffix search too; infix gets more tricky and it’s not clear that Glean is the best tool to use for infix or fuzzy text search: there are better data representations for that kind of thing. Still, case-insensitive prefix search is a useful thing to have.

Could we support Hoogle using Glean? Absolutely. That said, Hoogle doesn’t seem too slow. Also we need to index types in Glean before it could be used for type search.

Identify dead code

Dead code is, by definition, code that isn’t used anywhere. We have a handy way to find that: any identifier with no references isn’t used. But it’s not quite that simple: we want to ignore references in imports and exports, and from the type signature.

Admittedly finding unreferenced code within Hackage isn’t all that useful, because the libraries in Hackage are consumed by end-user code that we haven’t indexed so we can’t see all the references. But you could index your own project using Glean and use it to find dead code. In fact, I did that for Glean itself and identified one entire module that was dead, amongst a handful of other dead things.

Here’s a query to find dead code:

N where
  N = hs.Name _;
  N.sort.external?;
  hs.ModuleSource { mod = N.mod, file = F };
  !(
    hs.NameRefs { target = N, file = RefFile, uses = R };
    RefFile != F;
    coderef = (R[..]).kind
  )

Without going into all the details, here’s roughly how it works:

  • N = hs.Name _; declares N to be a fact of hs.Name
  • N.sort.external?; requires N to be external (i.e. exported), as opposed to a local variable
  • hs.ModuleSource { mod = N.mod, file = F }; finds the file F corresponding to this name’s module
  • The last part is checking to see that there are no references to this name that are (a) in a different file and (b) are in code, i.e. not import/export references. Restricting to other files isn’t exactly what we want, but it’s enough to exclude references from the type signature. Ideally we would be able to identify those more precisely (that’s on the TODO list).

You can try this on Hackage and it will find a lot of stuff. It might be useful to focus on particular modules to find things that aren’t used anywhere, for example I was interested in which identifiers in Control.Concurrent.Async aren’t used:

N where
  N = hs.Name _;
  N.mod.name = "Control.Concurrent.Async";
  N.mod.unit = "async-2.2.4-inplace";
  N.sort.external?;
  hs.ModuleSource { mod = N.mod, file = F };
  !(
    hs.NameRefs { target = N, file = RefFile, uses = R };
    RefFile != F;
    coderef = (R[..]).kind
  )

This finds 21 identifiers, which I can use to decide what to deprecate!

Apparatus

Building all of Hackage

The goal was to build as much of Hackage as possible and then to index it using both hiedb and Glean, and see how they differ.

To avoid problems with dependency resolution, I used a Stackage LTS snapshot of package versions. Using LTS-21.21 and GHC 9.4.7, I was able to build 2922 packages. About 50 failed for some reason or other.

I used this cabal.project file:

packages: */*.cabal
import: https://www.stackage.org/lts-21.21/cabal.config

package *
    ghc-options: -fwrite-ide-info

tests: False
benchmarks: False

allow-newer: *

And did a large cabal get to fetch all the packages in LTS-21.21.

Then

cabal build all --keep-going

After a few retries to install any required RPMs to get the dependency resolution phase to pass, and to delete a few packages that weren’t going to configure successfully, I went away for a few hours to let the build complete.

It’s entirely possible there’s a better way to do this that I don’t know about - please let me know!

Building Glean

The Haskell indexer I’m using is in this pull request which at the time of writing isn’t merged yet. (Since I’ve left Meta I’m just a regular open-source contributor and have to wait for my PRs to be merged just like everyone else!).

Admittedly Glean is not the easiest thing in the world to build, mainly because it has a couple of troublesome dependencies: folly (Meta’s library of highly-optimised C++ utilities) and RocksDB. Glean depends on a very up to date version of these libraries so we can’t use any distro packaged versions.

Full instructions for building Glean are here but roughly it goes like this on Linux:

  • Install a bunch of dependencies with apt or yum
  • Build the C++ dependencies with ./install-deps.sh and set some env vars
  • make

The Makefile is needed because there are some codegen steps that would be awkward to incorporate into the Cabal setup. After the first make you can usually just switch to cabal for rebuilding stuff unless you change something (e.g. a schema) that requires re-running the codegen.

Running Glean

I’ve done everything here with a running Glean server, which was started like this:

cabal run exe:glean-server -- \
  --db-root /tmp/db \
  --port 1234 \
  --schema glean/schema/source

While it’s possible to run Glean queries directly on the DB without a server, running a server is the normal way because it avoids the latency from opening the DB each time, and it keeps an in-memory cache which significantly speeds up repeated queries.

The examples that use Glass were done using a running Glass server, started like this:

cabal run glass-server -- --service localhost:1234 --port 12345

How does it work?

The interesting part of the Haskell indexer is the schema in hs.angle. Every language that Glean indexes needs a schema, which describes the data that the indexer will store in the DB. Unlike an SQL schema, a Glean schema looks more like a set of datatype declarations, and it really does correspond to a set of (code-generated) types that you can work with when programmatically writing data, making queries, or inspecting results. For more about Glean schemas, see the documentation.

Being able to design your own schema means that you can design something that is a close match for the requirements of the language you’re indexing. In our Glean schema for Haskell, we use a Name, OccName, and Module structure that’s similar to the one GHC uses internally and is stored in the .hie files.

The indexer itself just reads the .hie files and produces Glean data using datatypes that are generated from the schema. For example, here’s a fragment of the indexer that produces Module facts, which contain a ModuleName and a UnitName:

mkModule :: Glean.NewFact m => GHC.Module -> m Hs.Module
mkModule mod = do
  modname <- Glean.makeFact @Hs.ModuleName $
    fsToText (GHC.moduleNameFS (GHC.moduleName mod))
  unitname <- Glean.makeFact @Hs.UnitName $
    fsToText (unitFS (GHC.moduleUnit mod))
  Glean.makeFact @Hs.Module $
    Hs.Module_key modname unitname

Also interesting is how we support fast find-references. This is done using a stored derived predicate in the schema:

predicate NameRefs:
  {
    target: Name,
    file: src.File,
    uses: [src.ByteSpan]
  } stored {Name, File, Uses} where
  FileXRefs {file = File, refs = Refs};
  {name = Name, spans = Uses} = Refs[..];

here NameRefs is a predicate—which you can think of as a datatype, or a table in SQL—defined in terms of another predicate, FileXRefs. The facts of the predicate NameRefs (rows of the table) are derived automatically using this definition when the DB is created. If you’re familiar with SQL, a stored derived predicate in Glean is rather like a materialized view in SQL.

What’s next?

As I mentioned earlier, the indexer doesn’t yet index types, so that would be an obvious next step. There are a handful of weird corner cases that aren’t handled correctly, particularly around record selectors, and it would be good to iron those out.

Longer term ideally the Glean data would be rich enough to produce the Haddock docs. In fact Meta’s internal code browser does produce documentation on the fly from Glean data for some languages - Hack and C++ in particular. Doing it for Haskell is a bit tricky because while I believe the .hie file does contain enough information to do this, it’s not easy to reconstruct the full ASTs for declarations. Doing it by running the compiler—perhaps using the Haddock API—would be an option, but that involves a deeper integration with Cabal so it’s somewhat more awkward to go that route.

Could HLS use Glean? Perhaps it would be useful to have a full Hackage index to be able to go-to-definition from library references? As a plugin this might make sense, but there are a lot of things to fix and polish before it’s really practical.

Longer term should we be thinking about replacing hiedb with Glean? Again, we’re some way off from that. The issue of incremental updates is an interesting one - Glean does support incremental indexing but so far it’s been aimed at speeding up whole-repository indexing rather than supporting IDE features.

May 22, 2025 12:00 AM

May 08, 2025

Mark Jason Dominus

A descriptive theory of seasons in the Mid-Atlantic

[ I started thinking about this about twenty years ago, and then writing it down in 2019, but it seems to be obsolete. I am publishing it anyway. ]

The canonical division of the year into seasons in the northern temperate zone goes something like this:

  • Spring: March 21 – June 21
  • Summer: June 21 – September 21
  • Autumn: September 21 – December 21
  • Winter: December 21 – March 21

Living in the mid-Atlantic region of the northeast U.S., I have never been happy with this. It is just not a good description of the climate.

I begin by observing that the year is not equally partitioned between the four seasons. The summer and winter are longer, and spring and autumn are brief and happy interludes in between.

I have no problem with spring beginning in the middle of March. I think that is just right. March famously comes in like a lion and goes out like a lamb. The beginning of March is crappy, like February, and frequently has snowstorms and freezes. By the end of March, spring is usually skipping along, with singing birds and not just the early flowers (snowdrops, crocuses, daffodil) but many of the later ones also.

By the middle of May the spring flowers are over and the weather is getting warm, often uncomfortably so. Summer continues through the beginning of September, which is still good for swimming and lightweight clothes. In late September it finally gives way to autumn.

Autumn is jacket weather but not overcoat weather. Its last gasp is in the middle of November. By this time all the leaves have changed, and the ones that are going to fall off the trees have done so. The cool autumn mist has become a chilly winter mist. The cold winter rains begin at the end of November.

So my first cut would look something like this:

Months
Seasons
January
February
March
April
May
June
July
August
September
October
November
December
Winter
Spring
Summer
Autumn
Winter

Note that this puts Thanksgiving where it belongs at the boundary between autumn (harvest season) and winter (did we harvest enough to survive?). Also, it puts the winter solstice (December 21) about one quarter of the way through the winter. This is correct. By the solstice the days have gotten short, and after that the cold starts to kick in. (“As the days begin to lengthen, the cold begins to strengthen”.) The conventional division takes the solstice as the beginning of winter, which I just find perplexing. December 1 is not the very coldest part of winter, but it certainly isn't autumn.

There is something to be said for it though. I think I can distinguish several subseasons — ten in fact:

Dominus Seasonal Calendar

Months
Seasons
Sub-seasons
January
February
March
April
May
June
July
August
September
October
November
December
Winter
Spring
Summer
Autumn
Winter
Midwinter
Late Winter
Early spring
Late spring
Early Summer
Midsummer
Late Summer
Early autumn
Late autumn
Early winter
Midwinter

Midwinter, beginning around the solstice, is when the really crappy weather arrives, day after day of bitter cold. In contrast, early and late winter are typically much milder. By late February the snow is usually starting to melt. (March, of course, is always unpredictable, and usually has one nasty practical joke hiding up its sleeve. Often, March is pleasant and springy in the second week, and then mocks you by turning back into January for the third week. This takes people by surprise almost every year and I wonder why they never seem to catch on.)

Similarly, the really hot weather is mostly confined to midsummer. Early and late summer may be warm but you do not get blazing sun and you have to fry your eggs indoors, not on the pavement.

Why the seasons seem to turn in the middle of each month, and not at the beginning, I can't say. Someone messed up, but who? Probably the Romans. I hear that the Persians and the Baha’i start their year on the vernal equinox. Smart!

Weather in other places is very different, even in the temperate zones. For example, in southern California they don't have any of the traditional seasons. They have a period of cooler damp weather in the winter months, and then instead of summer they have a period of gloomy haze from June through August.

However

I may have waited too long to publish this article, as climate change seems to have rendered it obsolete. In recent years, we have barely had midwinter, and instead of the usual two to three annual snows we have zero. Midsummer has grown from two to four months, and summer now lasts into October.

by Mark Dominus (mjd@plover.com) at May 08, 2025 10:39 PM

May 05, 2025

Matthew Sackman

Payslips and tax: calculating your own

In the UK, it’s very common that your employer pays you once a month. When this happens, they give you a document called a payslip, that has some numbers on it, such as how much your salary is, how much they paid you this month, how much went to HMRC in tax, how much went to your pension, and a few other numbers. But they never show any workings, so you really have no way to check whether any of these numbers are correct. There are plenty of online take-home-pay calculators, but these all focus on the full year; they have no facility to calculate your next payslip.

About half way through April 2024, I stopped working for one company. Everything was wrapped up – I received my final payslip from them, along with my P45. I then had a few months off, and started a new job in July 2024. When you start a new job it always takes a while for money things to get sorted out, for example pension enrolment and sorting out pension contributions, so it’s really worthwhile to keep a close eye on your payslips particularly for these first few months. Mine were arriving and some numbers looked right, but other numbers, such as the amount of tax I was paying, were changing dramatically, month to month. I had no idea why; whether they should be changing like that; whether they were going to keep changing or would eventually settle down. I had no way to check any of these numbers. Was I going to get in trouble with HMRC and get investigated?

I was also a little on edge because this was the first job where my pension contributions were using a thing called Qualifying Earnings. In all my previous jobs, if I chose for 10% of my salary to go into my pension, then that’s what would happen. But now there was this thing called Qualifying Earnings, which is (numbers correct at time of writing) a band from £6240 to £50,270. If you’re earning, say £30k, then your x% contribution is actually x% of £30,000-£6240. If you’re earning above £50,270, then any further increase to your salary will not result in any extra contributions to your pension because you’re above the band. The 2008 Pensions Act, which created the legal requirement for all employees to have workplace pensions and for automatic enrolment (with a minimum 8% combined contribution from the employer and employee), also created this concept of Qualifying Earnings. I consider this is a pretty scummy way of reducing employer pension contributions for large firms. It complicates the maths and no doubt adds confusion for people trying to check their own payslips. Given that 74% of the population have pensions that are too small to retire on, this whole concept of Qualifying Earnings seems amoral at best.

These days, a lot of smaller companies outsource their payroll processing. In my case, I was officially working for an international Employer of Record and they were then outsourcing payroll processing to local firms with country-specific expertise. So when I started asking questions, there was no ability to go and sit with someone and work through it. Or have a call. It was all messages passed across multiple different systems, and partial answers at best would come back several days later. Even if your payroll is done in-house, I strongly suspect that a lot of the time, some software package will be being used that does all the calculations and quite likely no one will actually understand or be able to explain the maths that’s going on.

After a while of getting no-where, and after uncovering some substantial mistakes that had been made that affected me, I decided to spend some weekends actually figuring out how PAYE works, and writing some code that can calculate my next payslip. This library is available for anyone to use. There’s a README that hopefully explains the basic principles of how the calculations are done. This only works if your tax-code ends in an L, and it only works if you’re in National Insurance category A. All the code can do is use some details you provide to predict your next payslips. Also, I’m not a trained accountant or financial adviser, and even for my own payslips, every month, the numbers don’t quite match up (but they’re within £1). So please treat this as a toy, rather than the basis for building a payroll processor!

Getting started

The library is written in Go so you’ll need Go installed. Then, in a terminal do:

$ mkdir payslips
$ cd payslips
$ go mod init mypayslips
$ go get wellquite.org/tax@latest

Now we need to write a tiny amount of code. In your new payslips directory, create a main.go file, and open it in your editor. You want something like this:

package main

import (
   "fmt"
   "wellquite.org/tax"
)

func main() {
   payslips := tax.Payslips{
      {
         Year:                            2024,
         TaxCode:                         "1257L",
         Salary:                          tax.Yearly(50000),
         PensionType:                     tax.Salary,
         EmployeePensionContributionRate: 0.05,
         EmployerPensionContributionRate: 0.03,
      },
      {
         Salary:                          tax.Yearly(50000),
         PensionType:                     tax.Salary,
         EmployeePensionContributionRate: 0.05,
         EmployerPensionContributionRate: 0.03,
      },
      {},
      {
         Salary:                          tax.Yearly(60000),
         PensionType:                     tax.QualifyingEarnings,
         EmployeePensionContributionRate: 0.05,
         EmployerPensionContributionRate: 0.03,
      },
      {
         Salary:                          tax.Yearly(60000),
         PensionType:                     tax.QualifyingEarnings,
         EmployeePensionContributionRate: 0.15,
         EmployerPensionContributionRate: 0.03,
      },
      {
         Salary:                          tax.Yearly(60000),
         PensionType:                     tax.QualifyingEarnings,
         Expenses:                        116.08,
         EmployeePensionContributionRate: 0.15,
         EmployerPensionContributionRate: 0.03,
      },
   }

   payslips.Complete()
   fmt.Println(payslips)
}

We create a list of Payslips. The first payslip must specify a year, and your tax-code. These details are automatically applied to the payslips that follow, if not explicitly provided. Many of the calculations rely on year-to-date totals, and so we must have a complete record of your payslips from the start of the tax year. So that means the first payslip is month 1 (in this example, April 2024), then month 2 (May 2024) and so on. If you have no income for a month then you can just put in an empty payslip ({}). The above example describes being paid in April and May 2024, then nothing in June, and then being paid (with a higher salary) in July, August and September.

Save this main.go file. Then, back in your terminal, in your payslips directory, just do:

go run main.go

You should get some output showing all sorts of calculations, including income tax, and personal allowance. With a little luck, if you change the numbers to match your own salary and other details, the numbers produced should match quite closely your own payslips, provided nothing you’re doing is too exotic.

There is documentation for all the different fields that you can provide in each payslip. In general, the code will try to fill in missing values. It should be able to cope with things like salary-sacrifice, or, if you change job within a month and have several payslips for the same month, this should work too. Everything is run locally on your computer: please feel free to check the source – there are no 3rd party libraries at all, and nothing imports the net package. It’ll work just the same if you yank out your network cable or disable your WiFi.

Note however, this code is lightly tested. Whilst it works for me (and one or two friends), I make no claims that it correctly models the entirety of PAYE, so it may very well not work for you. Feedback, contributions, corrections, and patches are all very welcome!

May 05, 2025 02:30 PM