Profile!
07 Aug 2012We all agree that premature optimization is bad, right? Why do some of us still do it?
I have some pixel art that I want to magnify, and I like the way that Maxim Stepin’s hq2x family of filters look. A quick search led me to the hqx project that implements the filter and gives you a command line tool to run it offline. Downloaded it, compiled it, and gave it a try:
jfischer@solace:~/hqx_test$ file test.png
test.png: PNG image, 16 x 24, 8-bit/color RGBA, non-interlaced
jfischer@solace:~/hqx_test$ time ~/bin/hqx test.png out.png
real 0m1.777s
user 0m1.680s
sys 0m0.096s
Ok, yeah, nearly 2 seconds to convert a 16x24 image isn’t working for me. This is supposed to be a realtime algorithm; there’s no reason it should be this slow. I took a brief look at the code and spotted what I thought was the culprit, but just to be sure I profiled the program.
First, I rebuilt the program with profiling enabled (-pg
for gcc):
jfischer@solace:~/hqx_test$ cd ~/hqx-read-only/src/
jfischer@solace:~/hqx-read-only/src$ gcc -o ~/bin/hqx -I. -lIL -pg *.c
Now whenever I run ~/bin/hqx
it’ll write out a file named gmon.out
in my
current working directory with the information the profiler gprof
needs. Let’s
give that a try:
jfischer@solace:~/hqx_test$ time ~/bin/hqx test.png out.png
real 0m1.719s
user 0m1.604s
sys 0m0.112s
jfischer@solace:~/hqx_test$ gprof ~/bin/hqx gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 1.51 1.51 1 1.51 1.51 hqxInit
0.00 1.51 0.00 3002 0.00 0.00 rgb_to_yuv
0.00 1.51 0.00 2978 0.00 0.00 Interpolate_2
0.00 1.51 0.00 2184 0.00 0.00 yuv_diff
0.00 1.51 0.00 1957 0.00 0.00 Interpolate_3
0.00 1.51 0.00 1162 0.00 0.00 Interp8
0.00 1.51 0.00 1121 0.00 0.00 Interp3
0.00 1.51 0.00 960 0.00 0.00 Interp6
0.00 1.51 0.00 530 0.00 0.00 Interp2
0.00 1.51 0.00 467 0.00 0.00 Interp7
0.00 1.51 0.00 450 0.00 0.00 Interp1
0.00 1.51 0.00 434 0.00 0.00 Diff
0.00 1.51 0.00 245 0.00 0.00 Interp5
0.00 1.51 0.00 1 0.00 0.00 hq4x_32
0.00 1.51 0.00 1 0.00 0.00 hq4x_32_rb
Yup, hqxInit
is what I thought the culprit was. Let’s take a look at that
function:
uint32_t RGBtoYUV[16777216];
HQX_API void HQX_CALLCONV hqxInit(void)
{
/* Initalize RGB to YUV lookup table */
uint32_t c, r, g, b, y, u, v;
for (c = 0; c < 16777215; c++) {
r = (c & 0xFF0000) >> 16;
g = (c & 0x00FF00) >> 8;
b = c & 0x0000FF;
y = (uint32_t)(0.299*r + 0.587*g + 0.114*b);
u = (uint32_t)(-0.169*r - 0.331*g + 0.5*b) + 128;
v = (uint32_t)(0.5*r - 0.419*g - 0.081*b) + 128;
RGBtoYUV[c] = (y << 16) + (u << 8) + v;
}
}
hqxInit
spends a second and a half at startup time to create a lookup table to convert
any 8-bit per channel RGB value to the equivalent YUV value. The rest of the
program finishes up pretty much instantaneously. It also uses 64 MB of memory to do
this. I can understand why the author (who is not Maxim Stepin, as far as I can
tell) would do this: hqx is made to be usable as a library for realtime
filtering in a larger program. For the command line tool though, this is
ridiculous.
For the command line tool usage, there’s a super easy way to speed this up:
just drop the lookup table and do the math to convert the RGB value to YUV every
time. Let’s give that a try and replace the existing rgb_to_yuv
function with
one that doesn’t use the lookup table:
/* Lookup table version
static inline uint32_t rgb_to_yuv(uint32_t c)
{
// Mask against MASK_RGB to discard the alpha channel
return RGBtoYUV[MASK_RGB & c];
}
*/
static inline uint32_t rgb_to_yuv(uint32_t c)
{
/* Mask the alpha channel out of the color */
c = MASK_RGB & c;
uint32_t r, g, b, y, u, v;
r = (c & 0xFF0000) >> 16;
g = (c & 0x00FF00) >> 8;
b = c & 0x0000FF;
y = (uint32_t)(0.299*r + 0.587*g + 0.114*b);
u = (uint32_t)(-0.169*r - 0.331*g + 0.5*b) + 128;
v = (uint32_t)(0.5*r - 0.419*g - 0.081*b) + 128;
return (y << 16) + (u << 8) + v;
}
Then remove the call to hqxInit
from the beginning of the program. With those
changes, the program runs the way I expected it to:
jfischer@solace:~/hqx_test$ time ~/bin/hqx -s 2 test.png out.png
real 0m0.016s
user 0m0.008s
sys 0m0.004s
jfischer@solace:~/hqx_test$ gprof ~/bin/hqx gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 3002 0.00 0.00 rgb_to_yuv
0.00 0.00 0.00 2184 0.00 0.00 yuv_diff
0.00 0.00 0.00 747 0.00 0.00 Interpolate_3
0.00 0.00 0.00 602 0.00 0.00 Interp2
0.00 0.00 0.00 549 0.00 0.00 Interp1
...
So for my use, it’s fixed. If the author really wants to keep the lookup table around, there are some other ways to approach it that would avoid the startup penalty. The first one that comes to mind is memoization: the first time you look up a particular RGB value, calculate its YUV equivalent and store it. Subsequent lookups for that value can skip the calculation and just retrieve the stored value. It’d look something like this:
extern uint32_t RGBtoYUV[16777216];
static inline uint32_t rgb_to_yuv(uint32_t c)
{
/* Mask the alpha channel out of the color */
c = MASK_RGB & c;
if (RGBtoYUV[c] == 0) {
uint32_t r, g, b, y, u, v;
r = (c & 0xFF0000) >> 16;
g = (c & 0x00FF00) >> 8;
b = c & 0x0000FF;
y = (uint32_t)(0.299*r + 0.587*g + 0.114*b);
u = (uint32_t)(-0.169*r - 0.331*g + 0.5*b) + 128;
v = (uint32_t)(0.5*r - 0.419*g - 0.081*b) + 128;
RGBtoYUV[c] = (y << 16) + (u << 8) + v;
}
return RGBtoYUV[c];
}
That way, you only generate lookup entries for the RGB values you actually use. You’re still burning 64 MB of memory for that table though, and your access pattern into it is probably going to be all over the place, wreaking havoc on the CPUs caches. Plus, this way you’re introducing a branch that wasn’t there before. You’d have to test and profile a bunch to determine the most performant way to do this for a particular application, but my gut feeling is that skipping the lookup table and just doing the math every time is the fastest way to go.