Hey Valentin, I've heard you're into Go and performance
would you take a look at my HOTAS image processing? Only if you have a bit of spare time.
Hi Ankur
What's HOTAS?
Hands on throttle and stick - they're controllers used in flight simulator games. My app generates HOTAS
reference images on-the-fly, server-side.
Oooh very cool. Sure, I'd be happy to help.
The repo is here
github.com/ankurkotwal/MetaRefCard
I'm worried about a memory leak as the memory seems to grow but not shrink.
Also the image generation is kinda slow, which incurs a mildly annoying user-facing latency.
Ooh I see you've already written idiomatic tests and benchmarks! That's awesome,
it will make performance exploration much more straightforward 📈
I've run tests with
Pprof,
generated some call graphs... but its still not crystal clear for me.
Before I deploy to Cloud Run, I need to check if the image generation is likely to
OOM
the regular instance available memory.
I could pay more to have several GB, however a memory leak would still crash the instance eventually 🙄
OK, let's tackle this from 2 sides.
I'll read the code and try to reproduce the memory leak on my workstation.
Meanwhile could you deploy to Cloud Run (low memory), launch a few dozen requests, and see if you actually
experience any failures?
👌 Thank you!
So, I ran the tests with the -race flag, turns out there are some data races in the lazy init logic. I
opened a
PR
.
It's a cultural thing in Go, if a program has data races, we pedantically regard it as broken 😂
We want to fix the races before improving anything else
in this case,
sync.Once
and
sync.Map
come to the rescue
Ah! Thanks, I'll merge it
When requests are processed sequentially, there doesn't seem to be any big memory usage.
E.g. this is using the
Trace viewer,
it never gets above 20MB
Correct. The problem occurs when requests are processed concurrently
Which may happen often in production, as each request is taking several seconds to complete
Indeed when I launch a burst of 50 concurrent requests, I can see the memory usage
climbing to 750MB
You know, sometimes the Go runtime allocates large chunks of memory that you actually need and
doesn't give them back quickly to the OS, and that's mostly OK? This doesn't
always mean your code has any memory leak.
Even if we
explicitly trigger
the garbage collection after each request, often the program doesn't
seems to release any memory...
As long as you don't OOM, it's fine if your program is occupying more memory than
the amount your code's objects are actually using. It has some leeway. Maybe trust the GC? 🤷
I found this very suspicious at first, but okay I'll let the OS and the Go runtime
haggle over the memory heap, and hope for the best!
Also I haven't noticed any OOM in Cloud Run so far, this is going good 🤞
I had a look at the latency problem
You're drawing multiple text labels on the image in rectangles that don't
overlap, so I thought of doing that concurrently
Nah, I thought the same, unfortunately the
Font Face
objects are not thread-safe.
And we don't even need to try harder with more clunky synchronization patterns,
because the text drawing is actually not the bottleneck.
You're 100% right, we must focus on the actual bottlenecks, and "writing the
characters' pixels in the labels" is not one of them.
So, this is our CPU profile
18% of the total CPU time is spent encoding to JPEG.
Did you know about the drop-in encoder replacement
github.com/pixiv/go-libjpeg/jpeg
? It can encode 5 times as fast, shaving 15% of the total time. That's a quick win 😉
Updated CPU profile
Now trying to figure out why so much time (15s) is sunk in generateImages, while
(decode image + draw labels + encode image) is only (3s + 1.1s + 0.6s) = 4.7s...
Allright, here's what's going on.
You have a configurable background color, so the process is
- create a blank RGBA image
- fill it with the custom background color
- draw the card model PNG on top (it has transparency)
- draw the card text labels
for the 3rd step, the library
fogleman/gg
uses the
BiLinear
transformer, which is officially "slow" and incurs a lot of
NRGBA
/
RGBA
computations
The choice of the BiLinear kernel in gg is not configurable.
I thought that mixing "non-alpha-premultiplied" with "alpha-premultiplied" might
incur some arithmetic extra burden. Then I realized BiLinear was the problem.
Even if there's no option for that, the library is open source so we could hack
it to use e.g.
ApproxBiLinear
instead and see if the quality is high enough
That's an interesting rabbit hole
You know what? Choosing the background color is not a core feature of
my website. I can totally live with a default background color "baked" into the
original model card PNG, instead of doing that dynamically each time.
or maybe 3 or 4 variants of the PNG with different pre-baked background colors
Hey, I deployed to Cloud Run with these optimizations:
baked in background color, alternative JPEG encoder
It's way faster now! 🎉
Thank you for your help!
🥳