We all agree that premature optimization is bad, right? Why do some of us
still do it?
I have some pixel art that I want to magnify, and I like the way that Maxim
Stepin’s hq2x family of filters look. A quick search led me to the
hqx project that implements the filter and gives you a command line
tool to run it offline. Downloaded it, compiled it, and gave it a try:
Ok, yeah, nearly 2 seconds to convert a 16x24 image isn’t working for me. This
is supposed to be a realtime algorithm; there’s no reason it should be this slow.
I took a brief look at the code and spotted what I thought was the culprit, but
just to be sure I profiled the program.
First, I rebuilt the program with profiling enabled (-pg for gcc):
Now whenever I run ~/bin/hqx it’ll write out a file named gmon.out in my
current working directory with the information the profiler gprof needs. Let’s
give that a try:
Yup, hqxInit is what I thought the culprit was. Let’s take a look at that
function:
hqxInit spends a second and a half at startup time to create a lookup table to convert
any 8-bit per channel RGB value to the equivalent YUV value. The rest of the
program finishes up pretty much instantaneously. It also uses 64 MB of memory to do
this. I can understand why the author (who is not Maxim Stepin, as far as I can
tell) would do this: hqx is made to be usable as a library for realtime
filtering in a larger program. For the command line tool though, this is
ridiculous.
For the command line tool usage, there’s a super easy way to speed this up:
just drop the lookup table and do the math to convert the RGB value to YUV every
time. Let’s give that a try and replace the existing rgb_to_yuv function with
one that doesn’t use the lookup table:
Then remove the call to hqxInit from the beginning of the program. With those
changes, the program runs the way I expected it to:
So for my use, it’s fixed. If the author really wants to keep the lookup table
around, there are some other ways to approach it that would avoid the startup
penalty. The first one that comes to mind is memoization: the first
time you look up a particular RGB value, calculate its YUV equivalent and store
it. Subsequent lookups for that value can skip the calculation and just retrieve
the stored value. It’d look something like this:
That way, you only generate lookup entries for the RGB values you actually use.
You’re still burning 64 MB of memory for that table though, and your access pattern
into it is probably going to be all over the place, wreaking havoc on the CPUs
caches. Plus, this way you’re introducing a branch that wasn’t there before. You’d
have to test and profile a bunch to determine the most performant way to do this
for a particular application, but my gut feeling is that skipping the lookup table
and just doing the math every time is the fastest way to go.