Specialization is an optimization technique used by GHC to eliminate the performance overhead of ad-hoc polymorphism and enable other powerful optimizations. However, specialization is not free, since it requires more work by GHC during compilation and leads to larger executables. In fact, excessive specialization can result in significant increases in compilation cost and executable size with minimal runtime performance benefits. For this reason, GHC pessimistically avoids excessive specialization by default and may leave relatively low-cost performance improvements undiscovered in doing so.
Optimistic Haskell programmers hoping to take advantage of these missed opportunities are thus faced with the difficult task of discovering and enacting an optimal set of specializations for their program while balancing any performance improvements with the increased compilation costs and executable sizes. Until now, this dance was a clunky one involving desperately wading through GHC Core dumps only to come up with a precarious, inefficient, unmotivated set of pragmas and/or GHC flags that seem to improve performance.
In this two-part series of posts, I describe the recent work we have done to improve this situation and make optimal specialization of Haskell programs more of a science and less of a dark art. In this first post, I will
- give a comprehensive introduction to GHC’s specialization optimization,
- explore the various facilities that GHC provides for observing and controlling it, and
- present a simple framework for thinking about the trade-offs of specialization.
In the next post of the series, I will
- present the new tools and techniques we have developed to diagnose performance issues resulting from ad-hoc polymorphism,
- demonstrate how these new tools can be used to systematically identify useful specializations, and
- make sense of their impact in terms of the framework described in this post.
The intended audience of this post includes intermediate Haskell developers who want to know more about specialization and ad-hoc polymorphism in GHC, and advanced Haskell developers who are interested in systematic approaches to specializing their applications in ways that minimize compilation cost and executable sizes while maximizing performance gains.
This work was made possible thanks to Hasura, who have supported many of Well-Typed’s successful initiatives to improve tooling for commercial Haskell users.
I presented a summary of the content in this post on The Haskell Unfolder:
The Haskell Unfolder Episode 23: specialisation
Overloaded functions are common in Haskell, but they come with a cost. Thankfully, the GHC specialiser is extremely good at removing that cost. We can therefore write high-level, polymorphic programs and be confident that GHC will compile them into very efficient, monomorphised code. In this episode, we’ll demystify the seemingly magical things that GHC is doing to achieve this.
Ad-hoc polymorphism
In Haskell, an ad-hoc polymorphic or overloaded function is one whose type
contains class constraints. For example, this f
is an overloaded function:
f :: (Ord a, Num a) => a -> a -> a
=
f x y if x < y then
+ y
x else
- y x
For some type a
such that Ord a
and Num a
instances are provided, f
takes two values of type a
and evaluates to another a
.
Importantly, unlike type arguments, those class constraints are not erased at
runtime! Actually, they will be passed to f
just like any other value
argument, meaning f
at runtime is more like:
f :: Ord a -> Num a -> a -> a -> a
= ... f ord_a num_a x y
How does the definition of f
change to represent this? And what do these
ord_a
and num_a
values look like? This is how it works:
- Instances are compiled to records, typically referred to as dictionaries, whose fields are the definitions provided in the instance.
- Class functions (e.g.
<
in the body off
) become record selectors that are applied to the dictionaries to look up the appropriate definitions.
Thus, f
at runtime is more like:
f :: Ord a -> Num a -> a -> a -> a
=
f ord_a num_a x y if (<) ord_a x y then
+) num_a x y
(else
-) num_a x y (
The previously-infix class operators are now applied in prefix position to select the appropriate definitions out of the dictionaries, which are then applied to the arguments.
We can see this for ourselves by compiling the definition of f
in a module
F.hs
and emitting the intermediate representation (in GHC’s Core language):
ghc F.hs -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-ds
The -O
flag enables optimizations, and the
-ddump-ds
flag tells GHC to dump the Core representation of
the program after desugaring, before optimizations. The other flags make the
output more readable.
For a comprehensive introduction to GHC Core and the flags GHC accepts for viewing it, check out The Haskell Unfolder Episode 9: GHC Core.
The above command will output the following Core for f
:
= \ @a $dOrd $dNum x y ->
f case < $dOrd x y of {
False -> - $dNum x y;
True -> + $dNum x y
}
The if
has been transformed into a case
(Core has no if
construct). The
$dOrd
and $dNum
arguments are the Ord a
and Num a
instance dictionaries,
respectively. The <
operator is applied in prefix position (as are all
operators in Core) to the $dOrd
dictionary to get the appropriate
implementation of <
, which is further applied to x
and y
. The -
and +
operators in the branches of the case
are similar.
The extra allocations required to pass these implicit dictionary arguments and apply selectors to them do result in a measurable overhead, albeit one that is insignificant for most intents and purposes. As we will see, the real cost of ad-hoc polymorphism comes from the optimizations it prevents rather than the overhead it introduces.
Specialization
In this context, specialization refers to the removal of
ad-hoc polymorphism. When we specialize an overloaded expression e :: C a => S a
, we create a new binding eT :: S T
, where T
is some concrete type
for which a C T
instance exists. Here eT
is the specialization of e
at (or to) type T
.
For example, we can manually create a specialization of f
at type Int
. The
source definition stays exactly the same, only the type changes:
fInt :: Int -> Int -> Int
=
fInt x y if x < y then
+ y
x else
- y x
At the Core level, the dictionaries that were passed as value arguments to
f
are now used directly in the body of fInt
. If we add the definition of
fInt
to our example module and compile it as we did before, we get the
following output:
= \ @a $dOrd $dNum x y ->
f case < $dOrd x y of {
False -> - $dNum x y;
True -> + $dNum x y
}
fInt= \ x y ->
case < $fOrdInt x y of {
False -> - $fNumInt x y;
True -> + $fNumInt x y
}
fInt
no longer accepts dictionary arguments, and instead references the global
Ord Int
and Num Int
dictionaries directly. In fact, this definition of
fInt
is exactly what the GHC specializer would create if it decided to
specialize f
to Int
. We can see this for ourselves by manually instructing
GHC to do the specialization using a SPECIALIZE
pragma.
Our whole module is now:
module F where
{-# SPECIALIZE f :: Int -> Int -> Int #-}
f :: (Ord a, Num a) => a -> a -> a
=
f x y if x < y then
+ y
x else
- y
x
fInt :: Int -> Int -> Int
=
fInt x y if x < y then
+ y
x else
- y x
And the -ddump-ds
Core output becomes:
fInt= \ x y ->
case < $fOrdInt x y of {
False -> - $fNumInt x y;
True -> + $fNumInt x y
}
$sf
= \ x y ->
case < $fOrdInt x y of {
False -> - $fNumInt x y;
True -> + $fNumInt x y
}
= \ @a $dOrd $dNum x y ->
f case < $dOrd x y of {
False -> - $dNum x y;
True -> + $dNum x y
}
The GHC generated specialization is named $sf
(all specializations that GHC
generates are prefixed by $s
). Note that our specialization (fInt
) and the
GHC generated specialization ($sf
) are exactly equivalent!
Why is this an optimization?
The above transformation really is all that the GHC specializer does to our programs. It may not be immediately clear why this optimization is a meaningful optimization at all. That is because specialization is an enabling optimization: The real benefit comes from the optimizations that it enables later in the pipeline, such as inlining.
Inlining is the replacement of defined (top-level or let-bound) variables with
their definitions. Although f
and its specialization $sf
look similar, the
key difference is that f
includes calls to “unknown” functions passed as part
of the dictionary arguments, while $sf
includes calls to “known” functions
contained in the $fOrdInt
and $fNumInt
dictionaries. Since GHC has access to
the definitions of those dictionaries and the contained functions, they can be
inlined, exposing yet more opportunities for optimization.
We can see this in action by comparing the fully optimized bindings of our
example module to those just after desugaring. To do this, compile using the
same command as above but add the -ddump-simpl
flag,
which tells GHC to dump the Core at the end of the Core optimization pipeline
(also add -fforce-recomp
to force recompilation, since
we haven’t changed the code since our last compilation):
ghc F.hs -fforce-recomp -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-ds -ddump-simpl
The dumped output is:
==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
= {terms: 57, types: 37, coercions: 0, joins: 0/0}
fInt= \ x y ->
case < $fOrdInt x y of {
False -> - $fNumInt x y;
True -> + $fNumInt x y
}
$sf
= \ x y ->
case < $fOrdInt x y of {
False -> - $fNumInt x y;
True -> + $fNumInt x y
}
= \ @a $dOrd $dNum x y ->
f case < $dOrd x y of {
False -> - $dNum x y;
True -> + $dNum x y
}
==================== Tidy Core ====================
Result size of Tidy Core
= {terms: 44, types: 29, coercions: 0, joins: 0/0}
fInt= \ x y ->
case x of { I# x1 ->
case y of { I# y1 ->
case <# x1 y1 of {
-> I# (-# x1 y1);
__DEFAULT 1# -> I# (+# x1 y1)
}
}
}
= \ @a $dOrd $dNum x y ->
f case < $dOrd x y of {
False -> - $dNum x y;
True -> + $dNum x y
}
------ Local rules for imported ids --------
"USPEC f @Int" forall $dNum $dOrd. f $dOrd $dNum = fInt
The output of the desugaring pass is in the “Desugar (after optimization)” section, while the fully optimized output is in the “Tidy Core” section. The name “Desugar (after optimization)” only means it is the desugared Core output after GHC’s simple optimizer has run. The simple optimizer only does very lightweight, pure transformations to the Core program. We will still refer to the Core output of this stage as “unoptimized”.
During the full optimization pipeline, GHC identified the equivalence between
fInt
and $sf
and decided to remove $sf
. The fully optimized binding for
fInt
is unboxing the Int
s (pattern matching on the I#
constructor) and
using efficient primitive operations (<#
, -#
, +#
), while the fully
optimized binding for f
is the same as the unoptimized binding. The optimizer
simply couldn’t do anything with those opaque dictionaries in the way!
At the bottom of the output is the rewrite rule that the
SPECIALIZE
pragma created, which will cause any calls of
f
known to be at type Int
to be rewritten as applications of fInt
. This is
what allows the rest of the program to benefit from the specialization. The rule
simply discards the dictionary arguments $dNum :: Num Int
and $dOrd :: Ord Int
, which is safe because of global typeclass coherence: any dictionaries
passed explicitly must have originally come from the same global instances.
In summary, by replacing the opaque dictionary arguments to f
with references
to the concrete Ord Int
and Num Int
dictionaries in fInt
, GHC was able to
do a lot more optimization later in the pipeline.
Automatic specialization
In our example module, we manually instructed GHC to generate a specialization
of f
at Int
using a SPECIALIZE
pragma. In reality, we
often rely on GHC to figure out what specializations are necessary and generate
them for us automatically. GHC needs to be careful though, since specialization
requires the creation and optimization of more bindings, which increases
compilation costs and executable sizes
GHC uses several heuristics to avoid excessive automatic specialization by default. The heuristics are very pessimistic, which means GHC can easily miss valuable specialization opportunities that programmers may wish to manually address. This is precisely the manual effort that our recent work aims to assist, so before we go any further it’s important that we understand exactly when and why GHC decides specialization should (or should not) happen.
When does automatic specialization happen?
GHC will only potentially attempt automatic specialization in exactly one scenario: An overloaded call at a concrete, statically known type is encountered (we’ll refer to such calls as “specializable” calls from now on). This means that automatic specialization will only ever be triggered at call sites, not definition sites. Even in this scenario, there are other factors to consider which the following example will demonstrate.
Let’s add a binding foo
to our example module F.hs
from above:
foo :: (Integer, Integer) -> Integer
= f x y foo (x, y)
foo
makes a specializable call to f
at the concrete type Integer
, so we
might expect automatic specialization to happen. However, the inliner beats the
specializer to the punch here, which is evident in the
-ddump-simpl
output:
$wfoo
= \ ww ww1 ->
case integerLt ww ww1 of {
False -> integerSub ww ww1;
True -> integerAdd ww ww1
}
= \ ds -> case ds of { (ww, ww1) -> $wfoo ww ww1 } foo
Instead of specializing, GHC decided to eliminate the call entirely by inlining
f
, thus exposing other optimization opportunities (such as
worker/wrapper) which GHC took advantage of. This is
intended, since f
is so small and GHC knows that inlining it is very cheap and
likely worth the performance outcomes.
Another way we can observe the inlining decision by GHC here is via the
-ddump-inlinings
flag, which causes GHC to dump the
names of any bindings it decides to inline. Compiling our module with
ghc F.hs -O -fforce-recomp -ddump-inlinings
results in output indicating that GHC did decide to inline f
:
Inlining done: F.f
To inline or to specialize?
GHC prefers inlining over specialization, when possible, since inlining eliminates calls and doesn’t require creation of new bindings. However, excessive inlining is often even more dangerous than excessive specialization. So, even when a specializable call is deemed too costly to inline, GHC will still attempt to specialize it.
We can artificially create such a scenario in our example by adjusting what GHC
calls the “unfolding use threshold”. An “unfolding” is, roughly, the definition
of a binding that GHC uses when it decides to inline or specialize calls to that
binding. The unfolding use threshold governs the maximum effective
size1 of unfoldings that GHC will inline, and it can be
manually adjusted using the
-funfolding-use-threshold flag. Let’s set the
unfolding use threshold to -1, essentially making GHC think all inlining is very
expensive, and check the -ddump-simpl
output:
ghc F.hs -O -fforce-recomp -ddump-simpl -funfolding-use-threshold=-1
As we can see, GHC did specialize the call:
...
$sf1
f_= \ x y ->
case integerLt x y of {
False -> integerSub x y;
True -> integerAdd x y
}
= \ ds -> case ds of { (ww, ww1) -> f_$sf1 ww ww1 }
foo
------ Local rules for imported ids --------
"SPEC f @Integer" forall $dOrd $dNum. f $dOrd $dNum = f_$sf1
...
The name of the specialization (f_$sf1
) and the rewrite rule indicate that GHC
did successfully automatically specialize the overloaded call to f
.
Interestingly, the Core terms for foo
and its specialization f_$sf
are
alpha-equivalent to the terms we arrived at when GHC
inlined the call and applied worker/wrapper
instead2, with the specialization playing the same
role as the worker.
Cross-module automatic specialization
We have now discussed two prerequisites for automatic specialization of a call:
- The call must be specializable (i.e. it must be a call to an overloaded binding at a known type).
- Other optimizations, such as inlining, that remove the call or otherwise ruin the specializability of the call must not fire before specialization can occur.
In fact, for specializable calls which occur in the definition module of the overloaded binding (as was the case in our previous example), these are the only prerequisites. When the overloaded binding is imported from another module (as is most often the case), there are additional prerequisites which we’ll discuss now.
Exposed unfoldings and the INLINABLE
pragma
GHC performs separate compilation (as opposed to whole program compilation),
compiling one Haskell module at a time. When GHC compiles a module, it produces
not only compiled code in an object file, but also an interface file (with
suffix .hi
) . The interface file contains information about the module that
GHC might need to reference when compiling other modules, such as the names and
types of the bindings exported by the module. If certain criteria are met, GHC
will include a binding’s unfolding in the module’s interface file so that it can
be used later for cross-module inlining or specialization. Such unfoldings are
referred to as exposed unfoldings.
Now, you might reasonably wonder: If unfoldings are used to do these powerful optimizations, why does GHC only expose unfoldings which meet some criteria? Why not expose all unfoldings? The reason is that during compilation, GHC holds the interfaces of every module in the program in memory. Thus, to keep GHC’s own default performance and memory usage reasonable, module interfaces need to be as small as possible while still producing well-optimized programs. One way that GHC achieves this is by limiting the size of unfoldings that get included in interface files so that only small unfoldings are exposed by default.
There’s another wrinkle here that impacts cross-module specialization: Even if GHC decides to expose an overloaded binding’s unfolding, and a specializable call to that binding occurs in another module, GHC will still never automatically specialize that call unless it has been given explicit permission to create the specialization. Such explicit permission can only be given in one of the following ways:
- Mark the overloaded binding with either an
INLINABLE
orINLINE
pragma. - Enable the
-fspecialize-aggressively
flag while compiling the calling module.
Let’s explore this fact by continuing with our example. Move foo
, which makes
a specializable call to f
, to another module Foo.hs
that has
-funfolding-use-threshold
set to -1 to fool
the inliner as before:
{-# OPTIONS_GHC -funfolding-use-threshold=-1 #-}
module Foo where
import F
foo :: (Integer, Integer) -> Integer
= f x y foo (x, y)
Also remove everything from F.hs
except f
, for good measure:
module F where
f :: (Ord a, Num a) => a -> a -> a
=
f x y if x < y then
+ y
x else
- y x
Since f
is so small, we might expect GHC to expose its unfolding in the F.hi
module interface by default. If we compile with just
ghc F.hs
we get the object file F.o
and the interface file F.hi
. We can determine
whether GHC decided to expose the unfolding of f
by viewing the contents of
the interface file using GHC’s --show-iface
option:
ghc --show-iface F.hi -dsuppress-all
Specific information for each binding in the module is listed towards the bottom
of the output. The GHC Core of any exposed unfoldings will be displayed under
their respective bindings. In this case, the information for f
looks like
this:
bcb4b04f3cbb5e6aa2f776d6226a0930
f :: (Ord a, Num a) => a -> a -> a
[]
It only includes the type, no unfolding! This is because at GHC’s default
optimization level of -O0
, the
-fomit-interface-pragmas
and
-fignore-interface-pragmas
flags are
enabled which prevent unfoldings (among other things) from being included in and
read from the module interfaces. Recompile with optimizations enabled and check
the module interface again:
ghc -O F.hs
ghc --show-iface F.hi -dsuppress-all
This time, GHC did expose the unfolding:
152dd20f273a86bea689edd6a298afe6
f :: (Ord a, Num a) => a -> a -> a
...,
[Unfolding: Core: <vanilla>
@a
\ $dOrd['Many] :: Ord a)
($dNum['Many] :: Num a)
('Many] :: a)
(x['Many] :: a) ->
(y[case < @a $dOrd x y of wild {
False -> - @a $dNum x y True -> + @a $dNum x y }]
Remember, we still haven’t given GHC explicit permission to specialize calls to
f
across modules, so we should expect the fully optimized Core of Foo.hs
to
still include the overloaded call to f
. Let’s check:
ghc Foo.hs -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-simpl
The dumped Core includes:
$wfoo = \ ww ww1 -> f $fOrdInteger $fNumInteger ww ww1
= \ ds -> case ds of { (ww, ww1) -> $wfoo ww ww1 } foo
Indeed, GHC applied the worker/wrapper transformation
to foo
, but was not able to specialize the call to f
, despite it meeting our
previously discussed prerequisites for automatic specialization.
There is a warning flag in GHC that can notify us of such a case:
-Wall-missed-specializations
. Compile
Foo.hs
again, including this flag:
ghc Foo.hs -O -fforce-recomp -Wall-missed-specializations
This will output the following warning:
Foo.hs: warning: [-Wall-missed-specialisations]
Could not specialise imported function ‘f’
Probable fix: add INLINABLE pragma on ‘f’
If we do what the warning says by adding an INLINABLE
pragma on f
, and dump the core of Foo.hs
, we’ll see that automatic
specialization succeeds:
$sf
= \ x y ->
case integerLt x y of {
False -> integerSub x y;
True -> integerAdd x y
}
= \ ds -> case ds of { (ww, ww1) -> $sf ww ww1 }
foo
------ Local rules for imported ids --------
"SPEC/Foo f @Integer" forall $dOrd $dNum. f $dOrd $dNum = $sf
Removing the INLINABLE
pragma on f
and instead enabling
-fspecialize-aggressively
has the same
result.
The automatic specialization decision graph
We have now covered all the major prerequisites for automatic specialization. To summarize them, here is a decision graph illustrating the various ways that an arbitrary function call can trigger automatic specialization:
Now that we fully understand how, why, and when the GHC specializer works, we can move on to discussing the real problems that result from its behavior. Most of this discussion will be left for the next post in this series, but before concluding, I want to introduce something I call “the specialization spectrum”.
The specialization spectrum
Specialization is a very valuable compiler optimization, but I’ve mentioned many times throughout this post that excessive specialization can be a bad thing. This prompts a question: How do we know if we are appropriately benefitting from specialization? The meaning of “appropriately” here depends on application-specific requirements that dictate the desired size of our executables, how much we care about compilation costs, and how much we care about performance.
For example, if we want to maximize performance at all costs, we should make sure that we are generating and using the set of specializations that maximize the performance metrics we’re interested in, disregarding the increase in compilation costs and executable sizes.
Essentially, our goal is to find our ideal spot in the specialization spectrum.
This is our search space, with performance on one axis and code size and compilation cost on the other. The plotted points represent important application-agnostic points in the spectrum. Those points are:
- Baseline: Lowest performance and lowest cost. This point represents GHC’s default behavior where its heuristics will result in smaller code size and lower compilation cost but potentially miss specializations that would result in big performance wins.
- Ideal: As the application authors, we get to choose the location of this point based on our priorities. Typically, we want this as “high and to the left” as possible.
- Max performance: This point represents the optimal set of specializations, which will result in better runtime performance than any other set of specializations.
- Max specialization: This point is the result of generating
every3 possible specialization by enabling
-fexpose-all-unfoldings
and-fspecialize-aggressively
. Importantly, this is not always equivalent to max performance! If we generate useless specializations that result in little to no performance improvements but do grow the code size, we can end up losing performance due to more code swapping in and out of CPU caches.
The dotted line illustrates an approximate “optimal path” representing the results we might see as we generate all specializations in order of decreasing performance improvement.
This framework makes it clear that this really is just an optimization problem, with all the normal issues of traditional optimization problems at play. Unfortunately, in the absence of good tools for exploring this spectrum, it is particularly easy for programmers to get lost and go down treacherous, unoptimal paths like this:
Such cases are deceptive, making the programmer think they have landed in a good spot when they are actually in a poor-performing local optimum. Fortunately, the tools and techniques we’ll discuss in the next post of this series will greatly simplify optimal search of the specialization spectrum.
Summary
This concludes our introductory exploration of specialization. Here’s what we have learned:
- Calls to overloaded functions are compiled by passing dictionary values with a record of functions for each type class constraint.
- Specialization removes type class dictionary arguments from an overloaded function and replaces references to them with references to a concrete dictionary instead.
- Almost all of the benefit of specialization comes from the optimizations that it enables by replacing the opaque dictionary arguments with concrete dictionaries whose contents can be inlined.
- GHC will only automatically specialize calls if a specific set of conditions holds. See the automatic specialization decision graph.
- The specialization spectrum is a convenient framework for conceptualizing the impact of specialization on a program’s compilation cost and runtime performance.
In the next post of this series, we will apply all of what we have learned so far on some example applications, and demonstrate how the new tools we have developed can help us achieve optimal specialization and performance.
Footnotes
The effective size of an unfolding can be thought of as the number of terms in the Core representation of the unfolding, plus or minus some discounts that are applied depending on where GHC is considering inlining the unfolding.↩︎
This hints at a weak confluence of GHC Core and the reductions (i.e. optimizations) that the GHC optimizer applies to it.↩︎
Even with something like this in a
cabal.project
file:package * ghc-options: -fexpose-all-unfoldings -fspecialize-aggressively
Some overloaded calls may still not get specialized! This can occur if a chain of calls to overloaded functions includes a call to an overloaded function in a GHC boot library that cannot be reinstalled by Cabal, e.g.
base
, which does not have its unfolding exposed. The only way to specialize such calls is to build boot libraries from source with-fexpose-all-unfoldings
and-fspecialize-aggressively
, and include the snippet above in acabal.project
file.Additionally, some specific scenarios can cause overloaded calls to appear late in the optimization pipeline. To specialize those calls,
-flate-specialise
(British spelling required) is necessary, which runs another specialization pass at the end of GHC’s Core optimization pipeline.Further, even after the above, some overloaded calls may still survive without
-fpolymorphic-specialisation
(British spelling required), which is known to be unsound at the time of writing. Unfortunately, in complex applications, total elimination of overloaded calls is still quite a difficult goal to achieve.↩︎