The MetaRefCard performance conversation

Hey Valentin, I've heard you're into Go and performance

would you take a look at my HOTAS image processing? Only if you have a bit of spare time.

Hi Ankur

What's HOTAS?

Hands on throttle and stick - they're controllers used in flight simulator games. My app generates HOTAS reference images on-the-fly, server-side.

Oooh very cool. Sure, I'd be happy to help.

The repo is here github.com/ankurkotwal/MetaRefCard

I'm worried about a memory leak as the memory seems to grow but not shrink.

Also the image generation is kinda slow, which incurs a mildly annoying user-facing latency.

Ooh I see you've already written idiomatic tests and benchmarks! That's awesome, it will make performance exploration much more straightforward 📈

I've run tests with Pprof, generated some call graphs... but its still not crystal clear for me.

Before I deploy to Cloud Run, I need to check if the image generation is likely to OOM the regular instance available memory.

I could pay more to have several GB, however a memory leak would still crash the instance eventually 🙄

OK, let's tackle this from 2 sides. I'll read the code and try to reproduce the memory leak on my workstation.

Meanwhile could you deploy to Cloud Run (low memory), launch a few dozen requests, and see if you actually experience any failures?

👌 Thank you!

So, I ran the tests with the -race flag, turns out there are some data races in the lazy init logic. I opened a PR .

It's a cultural thing in Go, if a program has data races, we pedantically regard it as broken 😂
We want to fix the races before improving anything else

in this case, sync.Once and sync.Map come to the rescue

Ah! Thanks, I'll merge it

When requests are processed sequentially, there doesn't seem to be any big memory usage. E.g. this is using the Trace viewer, it never gets above 20MB

Correct. The problem occurs when requests are processed concurrently

Which may happen often in production, as each request is taking several seconds to complete

Indeed when I launch a burst of 50 concurrent requests, I can see the memory usage climbing to 750MB

You know, sometimes the Go runtime allocates large chunks of memory that you actually need and doesn't give them back quickly to the OS, and that's mostly OK? This doesn't always mean your code has any memory leak.

Even if we explicitly trigger the garbage collection after each request, often the program doesn't seems to release any memory...

As long as you don't OOM, it's fine if your program is occupying more memory than the amount your code's objects are actually using. It has some leeway. Maybe trust the GC? 🤷

I found this very suspicious at first, but okay I'll let the OS and the Go runtime haggle over the memory heap, and hope for the best!

Also I haven't noticed any OOM in Cloud Run so far, this is going good 🤞

I had a look at the latency problem

You're drawing multiple text labels on the image in rectangles that don't overlap, so I thought of doing that concurrently

Nah, I thought the same, unfortunately the Font Face objects are not thread-safe. And we don't even need to try harder with more clunky synchronization patterns, because the text drawing is actually not the bottleneck.

You're 100% right, we must focus on the actual bottlenecks, and "writing the characters' pixels in the labels" is not one of them.

So, this is our CPU profile

18% of the total CPU time is spent encoding to JPEG.

Did you know about the drop-in encoder replacement github.com/pixiv/go-libjpeg/jpeg ? It can encode 5 times as fast, shaving 15% of the total time. That's a quick win 😉

Updated CPU profile

Now trying to figure out why so much time (15s) is sunk in generateImages, while (decode image + draw labels + encode image) is only (3s + 1.1s + 0.6s) = 4.7s...

Allright, here's what's going on.
You have a configurable background color, so the process is

create a blank RGBA image
fill it with the custom background color
draw the card model PNG on top (it has transparency)
draw the card text labels

for the 3rd step, the library fogleman/gg uses the BiLinear transformer, which is officially "slow" and incurs a lot of NRGBA / RGBA computations

The choice of the BiLinear kernel in gg is not configurable.

I thought that mixing "non-alpha-premultiplied" with "alpha-premultiplied" might incur some arithmetic extra burden. Then I realized BiLinear was the problem.
Even if there's no option for that, the library is open source so we could hack it to use e.g. ApproxBiLinear instead and see if the quality is high enough

That's an interesting rabbit hole

You know what? Choosing the background color is not a core feature of my website. I can totally live with a default background color "baked" into the original model card PNG, instead of doing that dynamically each time.

or maybe 3 or 4 variants of the PNG with different pre-baked background colors

Hey, I deployed to Cloud Run with these optimizations: baked in background color, alternative JPEG encoder

It's way faster now! 🎉
Thank you for your help!

🥳