Function Inlining in Golang

What is Inlining?

Inlining is a form of compiler optimisation where the compiler replaces a function call with the function's body to reduce the overhead of the function call. The following diagram crudely shows what this process looks like.

Function inlining illustration

Let's take an example Go program.

// main.go
package main

func main() {
	var a int = 10
	var b int = 20
	
	c := sum(a, b)
	print(c)
	
	d := difference(b, a)
	print(d)
}

func sum(a int, b int) int {
	return a + b
}

func difference(a int, b int) int {
	return a - b
}

Build the above program using the following command,

GOOS=darwin GOARCH=amd64 go build -o expinline_darwin_amd64

To understand inlining in practice, we will have to look at the assembly code generated by the above command and for that, we will have to disassemble the binary expinline_darwin_amd64.

Before we do that, let's look at the symbols (I am looking for the function names here),

$ objdump --syms expinline_darwin_amd64 | grep -E '_main.main|sum|difference'        
0000000001057600 l     F __TEXT,__text _main.main
0000000001014ba0 l     F __TEXT,__text _runtime.(*limiterEvent).consume
0000000001025000 l     F __TEXT,__text _runtime.(*pallocBits).summarize
000000000102c9c0 l     F __TEXT,__text _runtime.resumeG

You will see no symbols for the functions sum and difference. To understand why this is the case, let's dig deeper into the assembly code,

$ objdump expinline_darwin_amd64 --disassemble-symbols=_main.main

Surprisingly, I see no function calls (callq) to sum or difference. It looks as if the compiler was smart enough to reduce the whole program to calling the print function on two absolute constants 30 (a+b) and 10 (b-a) denoted by movl $30, %eax and movl $10, %eax respectively, before the call to print.

Let's turn off this compiler optimisation to preserve calculation logic and reevaluate the results.

$ GOOS=darwin GOARCH=amd64 go build -gcflags='-N' -o expinline_darwin_amd64_no_op

And examine the _main.main symbol again,

$ objdump expinline_darwin_amd64_no_op --disassemble-symbols=_main.main

Now you can observe that there is addition logic between the offsets 105762a and 105763c and subtraction logic between the offsets 1057665 and 105767c. Again, there are no function calls (callq) to sum or difference, indicating that they have been inlined into the caller main().

Along with turning off the optimisation, inlining can also be turned off.

$ GOOS=darwin GOARCH=amd64 go build -gcflags='-N -l' -o expinline_darwin_amd64_no_op_no_inline

If you examine the symbols now, you can see the missing functions as well:

$ objdump --syms expinline_darwin_amd64_no_op_no_inline | grep -E '_main.main|sum|difference'
00000000010576e0 l     F __TEXT,__text _main.difference
0000000001057600 l     F __TEXT,__text _main.main
00000000010576a0 l     F __TEXT,__text _main.sum
0000000001014ba0 l     F __TEXT,__text _runtime.(*limiterEvent).consume
0000000001025000 l     F __TEXT,__text _runtime.(*pallocBits).summarize
000000000102c9c0 l     F __TEXT,__text _runtime.resumeG

And here's the disassembled code,

Why inline?

Improves performance by minimising the function call overheads

These overheads include stack manipulation, CPU register loading/unloading, branching, handling return values, etc.

Let's perform a quick benchmark test to gauge the gain in performance. The following bench function calls the sum and difference functions,

// main_bench_test.go
package main

import "testing"

var Result int

func BenchmarkSpecialBinomialProduct(b *testing.B) {
	var r int
	for i := 0; i < b.N; i++ {
		r = sum(i, i+7) * difference(i+7, i)
	}
	Result = r
}

Run it with all the default optimisations turned on,

$ go test -run='^$' -bench=. -benchmem -benchtime=10s -count=10 > bench_with_inlining.txt

Now add the compiler directive //go:noinline just above the function definitions of sum and difference and rerun the bench,

$ go test -run='^$' -bench=. -benchmem -benchtime=10s -count=10 > bench_with_no_inlining.txt

Finally, compare the results from the benchmark above,

$ benchstat {bench_with_no_inlining,bench_with_inlining}.txt

Albeit a contrived example, you can see a performance improvement of ~78%.

Two things to note about inlining.

  • It is a trade-off between space (increased binary size) and time (performance). This is because the function's body is copied to everywhere it is called.
  • It yields diminishing returns as the functions get larger or more complex because the call overhead to these functions becomes relatively negligible.

Augments escape analysis

Escape analysis is a compiler optimisation technique that determines if a value can be on the stack frame for the function constructing it or if the value must “escape” to the heap.

Let's consider the following example,

// escape/main.go
package main

type Point struct {
	x, y int
}

func NewPoint(x, y int) *Point {
	return &Point{x, y} // 8:9
}

func main() {
	p1 := NewPoint(10, 20) // 12:16
	p2 := NewPoint(30, 40) // 13:16
	
	print(p1.x, p2.y)
}

Build the above program with the -m compiler flag to print optimisation decisions,

$ go build -gcflags='-m=1' escape/main.go         
# command-line-arguments
escape/main.go:7:6: can inline NewPoint
escape/main.go:11:6: can inline main
escape/main.go:12:16: inlining call to NewPoint
escape/main.go:13:16: inlining call to NewPoint
escape/main.go:8:9: &Point{...} escapes to heap
escape/main.go:12:16: &Point{...} does not escape
escape/main.go:13:16: &Point{...} does not escape

Since NewPoint is expanded inside the main function, the lifetimes of Point{10, 20} and Point{30, 40} are bound within the main's scope.

Now build with inlining turned off,

$ go build -gcflags='-m=1 -l' escape/main.go
# command-line-arguments
escape/main.go:8:9: &Point{...} escapes to heap

Each call to NewPoint(..) will result in a Point object allocated to the heap memory.

How does the compiler do that?

Go's compiler uses a combination of budget and cost to determine whether to inline a function. Every function has a cost compared against a budget (threshold).

A function's cost is calculated using the complexity of the statements within the function body (one unit per node in the AST). And the budget changes from one release (40 in Go 1.6 | src/cmd/compile/internal/gc/inl.go) to another (80 | src/cmd/compile/internal/inline/inl.go).

Let's take the following example,

Figures a, b and c represent the same program, each with slight modifications (green highlight) to demonstrate cost allocation and mid-stack inlining.

Let's build the program in Figure a (output condensed for relevance),

$ GOOS=darwin GOARCH=amd64 go build -gcflags='-m=2' midstack/main.go                      
# command-line-arguments
midstack/main.go:30:6: can inline Sum with cost 4 as: func(int, int) int { return a + b }
midstack/main.go:23:6: can inline MidPoint with cost 28 as: func(Point, Point) Point { return Point{...} }
midstack/main.go:25:9: inlining call to Sum
midstack/main.go:26:9: inlining call to Sum
midstack/main.go:18:6: cannot inline PrintPoint: function too complex: cost 89 exceeds budget 80
midstack/main.go:20:13: inlining call to fmt.Println
midstack/main.go:10:6: cannot inline main: function too complex: cost 114 exceeds budget 80
midstack/main.go:13:26: inlining call to MidPoint
midstack/main.go:13:26: inlining call to Sum
midstack/main.go:13:26: inlining call to Sum

You can see that each function has a cost assigned to it, which is compared against the budget (80) to decide whether the function can be inlined. The functions MidPoint (cost: 28) and Sum (cost: 4) are inline-able, whereas PrintPoint (cost: 89) and main (cost: 114) are not.

I will also use this example to showcase the two forms of inlining,

  1. Leaf inlining
    • Earlier, only functions that did not call other functions (bottom of the call stack) were eligible for inlining. Once a leaf function is inlined, the resulting code can be inlined into its caller, a recursive process.
    • For example, Sum (a leaf function) gets inlined into MidPoint. Next, MidPoint (a leaf function now) gets inlined into main.
  2. Mid-stack inlining
    • Now, non-leaf functions are also eligible for inlining.
    • For example, in Figure c, the function Sum is marked with the directive //go:noinline, making it non-eligible for inlining. Even though MidPoint calls Sum, it is allocated a cost of 76, making it eligible for inlining.

References

  1. Inlining optimisations in Go - Dave Cheney
  2. Mid-stack inlining in Go - Dave Cheney
  3. Inlining - bright victories and hidden defeats
  4. Five things that make Go fast - Dave Cheney
  5. Mid-stack inlining in the Go compiler - David Lazar