专栏名称: 狗厂
51好读  ›  专栏  ›  狗厂

[英]Go 代码重构获得 23x 性能提升

狗厂  · 掘金  ·  · 2018-06-13 02:45

正文


Go code refactoring : the 23x performance hunt

A few weeks ago, I read an article called “ Good Code vs Bad Code in Golang ” where the author guides us step-by-step through the refactoring of an actual business use case.

The article focuses on turning “bad code” into “good code”: more idiomatic, more legible, leveraging the specifics of the go language. But it also insists on performance being an important aspect of the project. This triggered my curiosity: let’s dig in!


The program basically reads an input file, and parses each line to populate an object in memory.

The author not only published the source on Github , he also wrote an idiomatic benchmark. This was a really great idea, like an invitation to tweak the code and reproduce the measurements with the command:

$ go test -bench=.

So, on my machine the “good code” is 16% faster. Can we gain more?

In my experience there is an interesting correlation between code quality and performance. When you successfully refactor your code to make it clearer and more decoupled, you often end up making it faster, as it’s less cluttered with irrelevant instructions that were previously executed in vain, and also because some possible optimizations become obvious and easy to implement.

On the other hand, if you push further in your quest for performance, you will have to give up simplicity and resort to hacks. You will indeed shave milliseconds, but code quality will suffer, insofar as it will become more difficult to read and to reason about, more brittle, and less flexible.

It’s a trade-off: how far are you willing to go?

In order to properly prioritize your performance effort, the most valuable strategy is to identify your bottlenecks and focus on them. To achieve this, use profiling tools! Pprof and Trace are your friends:

$ go test -bench=. -cpuprofile cpu.prof
$ go tool pprof -svg cpu.prof > cpu.svg
$ go test -bench=. -trace trace.out
$ go tool trace trace.out

The trace proves that all CPU cores are used (bottom lines 0, 1, etc.), which looks like a good thing at first. But it shows thousands of small colored computation slices, and also some blank slots where some of the cores are idle. Let’s zoom in:

Each core actually spends a lot of its time idle, and keeps switching between micro-tasks. It looks like the granularity of the tasks is not optimal, leading to a lot of context switches and to contention due to synchronization.

Let’s check with the race detector if the synchronization is correct (if it’s not, then we have bigger issues than performance):

$ go test -race
PASS

Yes!! it seems correct, no data race condition was encountered.

The concurrency strategy in the “good” version consists in processing each line of input in its own goroutine, to leverage multiple cores. This is a legitimate intuition, as goroutines have the reputation to be lightweight and cheap. How much are we gaining thanks to concurrency? Let’s compare with the same code in a single sequential goroutine (just remove the go keyword preceding the line parsing function call)

Oops, it’s actually faster without any parallelism. This means that the (non-zero) overhead of launching a goroutine exceeds the time saved by using several cores at the same time.

The natural next step, as we are now processing lines sequentially instead of concurrently, is to avoid the (non-zero) overhead of using a channel of results: let’s replace it with a bare slice.

We’ve now gained a ~40% speedup from the “good” version, only by simplifying the code, removing the concurrency ( diff ).

Now let’s have a look at the hot function calls in the Pprof graph:

The benchmark of our current version (sequential, with slice) spends 86% of its time actually parsing messages, which is fine. We quickly notice that 43% of the total time is spent matching a regular expression with (*Regexp).FindAll .

While regexps are a convenient and flexible way to extract data from raw text, they have drawbacks, including a cost in memory and runtime. They are powerful, but probably overkill for many use cases.

In our program, the pattern

patternSubfield = "-.[^-]*"

is mostly intended to recognize “ commands ” starting with a dash “ - ”, and a line may have several commands. This, with a little tuning, could be done with bytes.Split . Let’s adapt the code ( commit , commit ) to replace the regexp with a Split:

Wow, this is an extra 40% perf gain!

The CPU graph now looks like this:

No more regexp huge cost. A fair share of the time (40%) is spent allocating memory, from 5 various functions. It’s interesting that 21% of the total time is now accounted for by bytes.Trim .

bytes.Trim expects a “cutset string” as an argument (for the separator), but we use it only with the single space byte as separator. This is an example where you may gain performance by introducing a bit of complexity: implement your own custom “trim” func in lieu of the standard library one. The custom “trim” deals with only a single separator byte.







请到「今天看啥」查看全文