Silly benchmarks
It started with me being curious: if I use an Integer in Kotlin, am I going to be paying a penalty for using a boxed primitive? So I wrote some silly benchmarks to confirm. First, in Java:
On my system (a mid-2011 MacBook Pro, 2.4 GHz i5) the primitive version takes 1.77 seconds, and the boxed version takes 20.8 seconds. Then in Kotlin:
The Kotlin one takes 1.81 seconds. Tiny bit slower than the Java primitive one, but that’s probably just due to needing a little more time for Kotlin’s runtime to load. Kotlin does unboxed primitives properly, yay!
Now I’m curious though: how do the other languages I use on the regular perform? Let’s try Clojure first, both a straightforward implementation and one tailored to match the Java one better:
sum-test-straightforward
took 5.1 seconds, and sum-test-gofast
1.69 seconds. The gofast
one is comparable to the Java one, probably a little slower: I ran these at a REPL, so there’s
no startup time involved.
Ok, how about Common Lisp? I can think of 3 approaches to take off the top of my head.
Using ClozureCL, all 3 of these perform abysmally:
- sum-test-iterative: 62.98 seconds, 2.82 of which were GC time. 20 GiB allocated.
- sum-test-recursive: 74.11 seconds, 3.52 of which were GC. 20 GiB allocated.
- sum-test-loop: 50.7 seconds, 2.58 of which were GC. 20 GiB allocated.
SBCL does much better off the bat, but still not great:
- sum-test-iterative: 7.7 seconds, no allocation
- sum-test-recursive: 17.1 seconds, no allocation
- sum-test-loop: 7.63 seconds, no allocation
Adding some type annotations and optimize flags helped SBCL, but ClozureCL’s times stayed the same:
SBCL’s sum-test-iterative
drops down to 3.13 seconds, still no allocation. No change on
Clozure. I’m probably doing something wrong here, but it’s not clear to me what. The
disassembly of sum-test-iterative
on SBCL shows that there’s still an allocation going
on there: maybe the problem is just that 64-bit integers don’t work unboxed due to SBCL’s
pointer tagging?
* (disassemble 'sum-test-iterative)
; disassembly for SUM-TEST-ITERATIVE
; Size: 69 bytes. Origin: #x1002AF97D7
; 7D7: 31C9 XOR ECX, ECX ; no-arg-parsing entry point
; 7D9: 31C0 XOR EAX, EAX
; 7DB: EB2D JMP L3
; 7DD: 0F1F00 NOP
; 7E0: L0: 488BD1 MOV RDX, RCX
; 7E3: 48D1F9 SAR RCX, 1
; 7E6: 7304 JNB L1
; 7E8: 488B4AF9 MOV RCX, [RDX-7]
; 7EC: L1: 488BD0 MOV RDX, RAX
; 7EF: 48D1FA SAR RDX, 1
; 7F2: 4801D1 ADD RCX, RDX
; 7F5: 48D1E1 SHL RCX, 1
; 7F8: 710C JNO L2
; 7FA: 48D1D9 RCR RCX, 1
; 7FD: 41BB00070020 MOV R11D, #x20000700 ; ALLOC-SIGNED-BIGNUM-IN-RCX
; 803: 41FFD3 CALL R11
; 806: L2: 4883C002 ADD RAX, 2
; 80A: L3: 483B057FFFFFFF CMP RAX, [RIP-129] ; [#x1002AF9790] = FFFFFFFE
; 811: 7CCD JL L0
; 813: 488BD1 MOV RDX, RCX
; 816: 488BE5 MOV RSP, RBP
; 819: F8 CLC
; 81A: 5D POP RBP
; 81B: C3 RET
NIL
Next up, Swift:
Without optimizations, 16 minutes 50 seconds. Holy shit.
With optimizations, 1.11 seconds.
Ok, last one, C:
Why take in the count parameter from the command line? Because Clang cheats. If I use the constant in there, it’s smart enough to just precalculate the whole thing and just return the final result.
Without optimizations:
solace:sum-tests jfischer$ clang -o sum-test SumTest.c
solace:sum-tests jfischer$ time ./sum-test 2147483647
2305843008139952128
real 0m8.247s
user 0m8.190s
sys 0m0.035s
With optimizations:
solace:sum-tests jfischer$ clang -Os -o sum-test SumTest.c
solace:sum-tests jfischer$ time ./sum-test 2147483647
2305843008139952128
real 0m0.006s
user 0m0.002s
sys 0m0.002s
It turns out, Clang still cheats even if the loop counter comes from outside. I’m pretty sure it recognized what I’m doing and just turned that loop into Gauss’ trick for computing an arithmetic series. It doesn’t matter what loop count I give it, it always takes the same amount of time with optimizations.
I can’t read/write assembly, but playing around on godbolt.org makes it look like that’s the case: https://godbolt.org/g/FmL66q. (There’s no loop in the disassembly.) And I can’t figure out how to trick it into not doing that, so I’ll call it quits for now.