A few weeks ago, I read an article called “
Good Code vs Bad Code in Golang
”
where the author guides us step-by-step through the refactoring of an actual business use case.
The article focuses on turning “bad code” into “good code”: more idiomatic, more legible, leveraging the specifics of the go language. But it also insists on
performance
being an important aspect of the project. This triggered
my curiosity: let’s dig in!
The program basically reads an input file, and parses each line to populate an object in memory.
The author not only published
the source on Github
, he also wrote an idiomatic benchmark. This was a really great
idea, like an invitation to tweak the code and reproduce the measurements with the command:
$ go test -bench=.
So, on my machine the “good code” is 16% faster. Can we gain more?
In my experience there is an interesting correlation between code quality and performance. When you successfully refactor your code to make it clearer and more decoupled, you often end up making it faster, as it’s less cluttered with irrelevant
instructions that were previously executed in vain, and also because some possible optimizations become obvious and easy to implement.
On the other hand, if you push further in your quest for performance, you will have to give up simplicity and resort to hacks. You will indeed shave milliseconds, but code quality will suffer, insofar as it will become more difficult to read and
to reason about, more brittle, and less flexible.
It’s a trade-off: how far are you willing to go?
In order to properly prioritize your performance effort, the most valuable strategy is to identify your bottlenecks and focus on them. To achieve this, use
profiling
tools!
Pprof
and
Trace
are your friends:
$ go test -bench=. -cpuprofile cpu.prof $ go tool pprof -svg cpu.prof > cpu.svg
$ go test -bench=. -trace trace.out $ go tool trace trace.out
The trace proves that all CPU cores are used (bottom lines 0, 1, etc.), which looks like a good thing at first. But it shows thousands of small colored computation slices, and also some blank slots where some of the cores are idle. Let’s zoom
in:
Each core actually spends a lot of its time idle, and keeps switching between micro-tasks. It looks like the granularity of the tasks is not optimal, leading to a lot of
context switches
and to contention due to synchronization.
Let’s check with the
race detector
if the synchronization is correct (if it’s not, then we have bigger issues than
performance):
$ go test -race PASS
Yes!! it seems correct, no data race condition was encountered.
The concurrency strategy in the “good” version consists in processing each line of input in its own goroutine, to leverage multiple cores. This is a legitimate intuition, as goroutines have the reputation to be lightweight and cheap. How much
are we gaining thanks to concurrency? Let’s compare with the same code in a single sequential goroutine (just remove the
go
keyword preceding the line parsing function call)
Oops, it’s actually faster without any parallelism. This means that the (non-zero) overhead of launching a goroutine exceeds the time saved by using several cores at the same time.
The natural next step, as we are now processing lines sequentially instead of concurrently, is to avoid the (non-zero) overhead of using a channel of results: let’s replace it with a bare slice.
We’ve now gained a ~40% speedup from the “good” version, only by simplifying the code, removing the concurrency (
diff
).
Now let’s have a look at the hot function calls in the Pprof graph:
The benchmark of our current version (sequential, with slice) spends 86% of its time actually parsing messages, which is fine. We quickly notice that 43% of the total time is spent matching a regular expression with
(*Regexp).FindAll
.
While regexps are a convenient and flexible way to extract data from raw text, they have drawbacks, including a cost in memory and runtime. They are powerful, but probably overkill for many use cases.
In our program, the pattern
patternSubfield = "-.[^-]*"
is mostly intended to recognize “
commands
” starting with a dash “
-
”, and a line may have several commands. This, with a little tuning, could be done with
bytes.Split
.
Let’s adapt the code (
commit
,
commit
)
to replace the regexp with a Split:
Wow, this is an extra 40% perf gain!
The CPU graph now looks like this:
No more regexp huge cost. A fair share of the time (40%) is spent allocating memory, from 5 various functions. It’s interesting that 21% of the total time is now accounted for by
bytes.Trim
.
bytes.Trim
expects a “cutset string” as an argument (for the separator), but we use it only with the single
space
byte as separator. This
is an example where you may gain performance by introducing a bit of complexity: implement your own custom “trim” func in lieu of the standard library one. The
custom “trim”
deals with only a single separator byte.