This is a cross-post of a post I authored for Well-Typed.

Haddock is the documentation generation tool for Haskell. Given a set of Haskell modules, Haddock can generate documentation for those modules in various formats based on specially formatted comments in the source code. Haddock is a very important part of the Haskell ecosystem, as many authors and users of Haskell software rely on Haddock-generated documentation to familiarize themselves with the code.

Recently, Mercury asked us to investigate performance problems in Haddock that were preventing their developers from using it effectively. In particular, when running on Mercury’s code, Haddock would eat up all of the memory on 64GB machines. This made it difficult for their developers to generate documentation to browse locally.

At a high level, the work covered by this post has resulted in Haddock’s memory usage being roughly halved. The full set of Haddock and GHC changes resulting in these improvements will be shipped with GHC 9.8.

All this profiling and debugging work was completed using the eventlog2html and ghc-debug tools, which are excellent for characterising and diagnosing Haskell program performance. However, these are much less widely known than they deserve to be. This post aims to act as a case study demonstrating how we have used the tools to make significant performance improvements to a complex, real-world application.

If you’re interested in finding out more about these tools, read on or check out some of our videos:

Characterising the problem

The first step towards diagnosing Haddock’s memory behavior is generating a memory profile with eventlog2html.

An eventlog is a file that can be generated by the GHC runtime system when a Haskell program is executed, containing lots of information about the execution that is useful for profiling. When the program is run with heap profiling enabled, it will regularly emit heap usage samples to the eventlog. Subsequently, eventlog2html can be used to convert this file into an HTML document with interactive visualisations and statistics of the program’s memory profile. Exploring a profile generated by eventlog2html often leads to discovery of optimization opportunities that are otherwise difficult to find.

The following baseline profile shows live heap data over time used by Haddock to generate HTML documentation on the Agda codebase:¹

You can also see the full HTML page generated by eventlog2html for the baseline, and later profiles are also accompanied by links to the HTML version.

Note: Chrome may fail to render the “Detailed” tab of the profiles (see eventlog2html issue 166). Firefox is recommended.

This shows that the memory usage increases through the run until it reaches a peak with a 1.1GB of live data on the Haskell heap. Since this counts only live data, the actual process memory usage will be much higher (e.g. the copying garbage collector at least doubles the amount of memory needed). The “Heap” tab of the eventlog2html output shows more detail: it includes “Live Bytes” (as in the main profile) and “Heap Size” (the total memory used by the runtime system), which in this example reaches 2.3GB.

Info table profiling

This profile was generated using info table profiling, a new profiling method that avoids the need for recompiling the application in profiling mode and provides extra detail. You can read more about info table profiling in our previous post. This means the eventlog2html “Detailed” tab is also very revealing (full HTML page here):

An info table roughly corresponds to an allocation site (such as a thunk or data constructor application) in the source code. For each info table, we see a sparkline profile of the heap usage over time, plus type information and a source location showing the code responsible for the allocation. This allows us to explore the sources of excessive allocations in more detail.

Any source module provided to Haddock is first type-checked using the GHC API. Thus we need to carefully distinguish what parts of the profile are coming from GHC and not Haddock itself. Digging around in the details revealed that the visible spike in memory usage at the start of the profile is coming from the GHC typechecker. This can be seen by finding the closures (heap objects) that contribute to that spike, the majority of which are list closures allocated in the GHC.Tc.Utils.Zonk module (which is only used for type checking).

Finding retainers using `ghc-debug`

The third row of the table shows that much of the heap is occupied by occurrences of the GRE constructor of the GlobalRdrElt type. These values represent the “global renaming environment”, used to track things that are in-scope during name resolution.² Knowing how Haddock operates, there is no good reason these closures should be sticking around on the heap throughout the entire run. Rather, once a module has been processed by Haddock, scope information for that module should no longer be retained in memory.

To figure out why these closures are building up on the heap, we can use ghc-debug’s retainer profiling capabilities. ghc-debug is a tool to interactively or programmatically inspect the heap of a Haskell process by pausing it and attaching a debugger process. To find out more, you can check out our previous post on using ghc-debug to analyse memory fragmentation.

After instrumenting Haddock with ghc-debug we can connect to the process using ghc-debug-brick, which provides a terminal UI for interactively exploring and running queries over the heap. In particular, it can search for retainers of GRE closures, which results in retainer paths like the following:

The GRE closure we are examining is shown at the very top. The items beneath it labeled as fields are closures that this GRE closure is retaining (keeping references to). The retainer path of the GRE closure is displayed beneath the fields. Each closure shown in the path is directly referenced by the closure beneath it (though these paths are not necessarily unique, so it may be referenced elsewhere as well).

In this case, we see from retainer number 15 that the GRE closure is transitively retained by a ModIface closure, which we have expanded to show its fields. By comparing closure addresses, we see that the reference is in field 12, namely the mi_globals field of the ModIface constructor.

While it requires some knowledge of the workings of the Haskell heap and the expected memory behaviour of the program, ghc-debug makes it possible to track down issues very precisely.

Fixing the problem

As the documentation for the mi_globals field indicates, it is really used only by GHCi. However, it was also enabled for GHC’s “no-code” backend at some point, which is what Haddock uses. It turns out we can set this field to Nothing for GHC’s no-code backend and reduce Haddock’s memory usage by about 20% without any negative side effects. See !10469 for the patch.

With just this change in GHC and no changes to Haddock, the peak live data residency for Haddock on the Agda codebase is now down below 950MB, with the following profile (full eventlog2html output here):

eventlog2html area chart after eliminating GRE closures

There are no more GRE closures dominating the profile in the “Detailed” tab:

eventlog2html detail chart, GRE closures eliminated

Reducing memory usage the easy way

Answering difficult questions about a program’s memory usage will often lead to optimization opportunities. This happened above when we essentially answered the question of “what are these GRE constructors doing here?” In contrast, answering easy questions can sometimes be just as productive. This is especially true for Haddock, since it has been in development for over 15 years and has lots of dusty corners in its source code as a result.

In essence, Haddock is a compiler that translates Haskell source code into an intermediate representation and then into some documentation format (e.g. HTML or LaTeX). The data structure Haddock uses as its intermediate representation is the Interface type. The core functions of Haddock are to generate Interface values from Haskell modules using the GHC API, then to generate documentation from those Interface values.

Going field by field through the Interface type and answering the simple question of “what is this field used for?” led to many optimizations. There were even fields of the Interface type that were being defined (and thus allocated) but were never used. These fields could simply be removed altogether.

A less trivial optimization also resulted from this process. Haddock is able to generate a textual format that can be ingested and indexed by Hoogle, the Haskell API search engine. One of the fields of the Interface type (ifaceRnExportItems) was being used only to generate these Hoogle files, and was being populated even if Hoogle output was not requested. Additionally, the thunks being retained by this field were disproportionately large for the results they were used to compute. Instead we can generate the required values eagerly, only when Hoogle output is requested.

These changes resulted in another drastic decrease in Haddock’s memory usage. With the patched GHC and Haddock, peak live data residency on the Agda benchmark is now just over 550MB, and the profile looks like this (full eventlog2html output here):

eventlog2html area chart after general memory fixes

Hi, Haddock!

Haddock was originally designed to type-check the source modules using the GHC API, then generate its intermediate representation directly from the results of type-checking. This had two significant downsides:

Even if the program has been compiled already, it must be type-checked again to generate documentation (and because of the need to execute Template Haskell splices, re-type-checking frequently implies full recompilation).
As we have seen, the results of type-checking contain far more information than is necessary to generate documentation alone.

The long-running “Hi Haddock” effort addressed this by fundamentally changing the way Haddock works. Instead of generating documentation from the results of type-checking, Haddock now uses the information available in the interface files (.hi files) produced by GHC during compilation to describe the interface of each module.

This fixed both of the issues listed above:

Once the program has been compiled, the information Haddock needs is already present in the .hi files and can be reused later. Moreover the existing recompilation avoidance mechanism can be used to decide whether subsequent compilation is necessary when source files have changed.
The documentation structures stored in .hi files can be the minimal amount required by Haddock, avoiding the need to represent additional data in memory.

The Hi Haddock project was a major undertaking, originally started by Simon Jakobi in 2018 and with subsequent contributions from Alec Theriault, Hécate Moonlight and Zubin Duggal. As part of our work to reduce Haddock’s memory usage, Well-Typed finished off and landed the patch for Haddock (Haddock MR 1597). We’re grateful to everyone who contributed to the Hi Haddock work over the years.

The performance impacts were very good. Adding the Hi Haddock changes on top of the aforementioned patches resulted in a peak live data residency of only 380MB, with the following profile (full eventlog2html output here):

eventlog2html area chart with Hi Haddock changes

Moreover, when recompilation can be avoided because up-to-date compilation results are available already, the memory usage is slightly lower but the time taken drops significantly (full eventlog2html output here):

eventlog2html area chart when avoiding recompilation

Conclusion and future work

Thanks to this work, Haddock will use significantly less memory when generating documentation. Many of these improvements were achieved as a result of profiling Haddock using eventlog2html and ghc-debug. On top of those improvements, we have landed the Hi Haddock changes which resulted in additional performance gains.

The results presented in this blog post were captured while Haddock was running its default HTML-only mode, as this work was primarily focused on reducing the baseline memory usage of Haddock. Haddock has several backends and modes of operation that are likely ripe with more opportunity for both space and time optimizations.

Furthermore, more work is needed for users to be able to take full advantage of the benefits of Hi Haddock. It is currently very difficult to avoid recompilation during documentation generation when Haddock is executed through Cabal, as Cabal passes several flags to Haddock that always invalidate the previous build results. We are currently implementing the necessary changes to Cabal and Haddock to smooth out this interface (see Cabal MR 9177 and Haddock MR 38).

We are extremely grateful to Mercury for making this work on Haddock possible, and to all the Haddock maintainers and contributors. Thanks also to David Christiansen for feedback on a draft of this blog post.

Well-Typed is able to work on GHC, HLS, Cabal, Haddock and other core Haskell infrastructure thanks to funding from various sponsors. If your company might be able to contribute to this work, sponsor maintenance efforts, or fund the implementation of other features, please read about how you can help or get in touch.

The Agda codebase was chosen as a sample for this work because it is a large enough project that the phases of Haddock’s execution are apparent in the memory profile, and it uses a diverse set of Haskell language features that should result in significant coverage of Haddock’s code paths.↩︎
See this GHC wiki page for more information on the renamer.↩︎

Reducing Haddock's Memory Usage