My JSConf 2012 talk

Last week I travelled to US for the first time in my life to give a talk about V8 at JSConf 2012. The conference absolutely fantastic and allowed me to meet a lot of different people. If you know any other conference that would allow me to chat with Richard Hudson about transactional memory applications for concurrent GCs, listen to Dan Ingalls talk, discuss compiler design with people from Mozilla and Microsoft and make s’mores with Rickard Falkvinge - give me a call and I’ll buy some tickets :-)

But there is another reason why JSConf was an invaluable experience for me: I like talking and sharing knowledge but both preparing and speaking is hard for me. Partially because I am not exactly very good in speaking English - my thoughts and jokes-cortex are accustomed to immense flexibility of Russian grammar. It’s also always hard to fit both useful information and some entertainment into 30 minutes talk.

Yes, I do think that any talk should be both useful and entertaining. Some talks are entertaining but not useful. Some are highly useful but not entertaining. The balance is crucial.

When I was preparing my talk for JSConf 2012 I decided that I don’t want to speak about V8 basics again: hidden classes, number representations, optimizing vs. non-optimizing compiler, flags to trace optimizations, profiler, generic performance advices. There are numerous presentations on this: from V8 developers mine, Daniel Clifford’s, from Chrome DevRel team Lilly Thomson’s and from Mozilla SpiderMonkey team David Mandelin’s. There are also numerous posts (some on this blog, some in Andy Wingo’s) plus +Florian Loitsch started turning notes he took while working on Dart to JavaScript compiler into blog posts. I highly recommend reading them. 

Thus I considered foundations well covered and decided to choose a different goal for my talk: I wanted to encourage people to look under V8’s hood. Following this decision I called my talk “Can V8 do that?!” and tried to demonstrate that V8 assembly and HIR can be quite readable with some practice and that they hide a lot of wonders. I don’t yet know if I succeeded or failed, only future will tell, but I will monitor v8-dev & v8-users mailing lists for questions about V8 internals and incoming patches :-)

I also have a suspicion that there are still might be some misunderstandings (e.g. if you think that V8 can always hoist expression like a.length from the loop please raise your hand in comments!). Nevertheless… I will continue to work on my ability to convey knowledge in an entertaining way and I am looking forward to my next talk: 1h with V8 internals in the beautiful Oslo at Web Rebels

MY JSCONF 2012 SLIDES

Comments
“I want to optimize my JS application on V8” checklist

Samurais had something called bushido, way of the warrior, code of conduct they had to follow. In similar manner you have to follow certain opt-dō if you want to optimize your application. I have tried to sketch such a path in my nodecamp.eu talk “Understanding V8” and +Daniel Clifford tried to do the same in “V8 Performance Tuning Tricks” talk on GDD11 in Berlin. But not everybody has seen those talks and the question keeps coming back again. So I decided to write down a quick check list for developers who want to optimize their apps. tl;dr version of my checklist is: “Understand before you act”. 

Do you understand what the application is trying to do and how?

The more you understand about your app the better you can optimize it. Sometimes a tricky algorithm or a cache placed in right place will yield more improvements than any local tweaking. Understanding your application in large is a very difficult problem which requires special tooling and discipline. I highly recommend to read @coda’s Metrics Metrics Everywhere talk if you want to get a glimpse of that world. Sometimes it is possible to split big application into pieces and optimize them separately but there is no guarantee that overall gain will not be lost when those pieces are connected back together.

Did you profile your application with built in statistical profiler?

Profiling helps to discover obvious hot spots. Don’t waste time rewriting places that occupy 0.0001% of running time. Concentrate your efforts on those that are high on the profile. If you are using V8’s tick processors keep in mind that LazyCompile: prefix does not mean that this time was spent in compiler, it just means that the function itself was compiled lazily. Statistical profiler is not the most accurate tool in the world and might miss overheads that are finely spread across execution (as sampling interval is 2ms). Tools like dtrace, perf, Instruments, VTune might provide a more fine grained picture but they do not necessarily have support for JITed code (see below).

JavaScript function is high on the profile

Ensure that this function is optimized and Crankshaft friendly. V8’s tick processing scripts mark optimized functions with * (asterisk) and non-optimized with ~ (tilda). You can also use --trace-opt --trace-deopt --trace-bailout flags to see what Crankshaft does with your program. Deoptimizations happen when assumptions made by the compiler does not match program’s runtime behavior, bailouts happens when compiler can’t compile the function with optimizations for some reason. If you want to understand ideas behind V8 optimization pipeline I recommend to start by reading Andy Wingo’s A Tale of Two Compilers post.

In general it’s a good idea to know more about modern JavaScript VMs, especially their strengths and weaknesses. I recommend going through David Mandelin’s talk Know Your Engines. This talk stresses a very important aspect of modern JS performance: fastest application is the one that is essentially statically typed in it’s nature.

Modern JavaScript VMs try to grasp “static” structure hidden inside dynamic JS code by utilizing hidden classes and inline caches. Take a look at my slide deck to get a basic understanding of how those hidden classes are built and used.  

For V8 it is also important to check how you store floating point numbers (and integers that exceed 31-bit range in case of ia32 version of v8) and use WebGL typed arrays if appropriate. These days V8 tries to adapt generic arrays’ storage to the data you store in them, but understanding whether those optimizations kicked in or not might be difficult; thus I just recommend using typed arrays.

GC is high on the profile

Try to understand what your are allocating and (more important) what survives several GCs. The worst kind of object is the one that survives a couple of partial (aka scavenge) collections and then gets thrown away. This kind of workload is the most stressful for GC because it has to copy young objects around constantly. Objects that live long are less stressful (but you have to keep in mind that GC cost is proportional to the number of live objects). The best kind of object is the one that dies shortly after it’s allocation. You can use --trace-gc to see GC pauses and you can use built in heap snapshots to figure out what takes space in your heap. [it might be hard or impossible to capture “middle-aged” garbage with heap snapshots because V8 does full garbage collection before taking snapshot thus effectively killing all such garbage].

JS natives are high on the profile

When I say _natives_ I mean built in methods of String/Number/Boolean/RegExp/JSON and global functions like parseInt etc. Here you can’t optimize anything directly but you can try to figure out two things:

  • Try calling them less by changing your algorithms and/or fusing them into it. Some of those methods are very generic (e.g. forEach). Some can be fused with your functions (e.g. you have to parse integer contained in some stream: you can either build a temporary string character by character and pass it to parseInt or you can fuse parsing and reading from a stream; later is better)
  • Is there some obvious performance problem with them? V8’s implementation of the native method can be suboptimal. If you see a bug (or you suspect that it can be improved) please file a bug or write a question to v8-users mailing list.

Some strange V8 internals are high on the profile

In this case you can either read V8’s source or send a question to v8-users list.

A lot of time is spent in your C++ code

Sorry this is out of scope. Consult C++ optimization guides :-)

Do you feel that V8’s statistical profiler misses hotspot?

Your best bet then is either hardware counters based tool like Linux perf for which V8 has support (see v8/tools/ll_prof.py --help for more details) or trying to spot anomalies by some sort of software counters based profiling. V8 has it’s own simple software counters subsystem (try passing --native-code-counters --dump-counters to d8 shell).

Do you want to go deeper?

If you feel that generated code is slow and you can improve it you should definitely check it out using flags --print-code --code-comments. You can also dump IR used by optimizing compiler with --trace-hydrogen. IR will be written into hydrogen.cfg file that can be viewed by C1 Visualizer.

Are you still lost?

Drop a line to me or better to v8-users mailing list. Try your best to provide as much context as possible (a standalone JS benchmark is the best way). It’s nearly impossible to diagnose performance problems based on vague descriptions of what you are trying to achieve and how slow it runs.

There is something to be learned from a rainstorm. When meeting with a sudden shower, you try not to get wet and run quickly along the road. But doing such things as passing under the eaves of houses, you still get wet. When you are resolved from the beginning, you will not be perplexed, though you still get the same soaking
Hagakure by Yamamoto Tsunetomo.

Similarly you have to be resolved from the beginning when you want to optimize your app. Randomly tweaking things in panic here and there does not help.

Understand before you act.

Comments
The trap of the performance sweet spot

Disclaimer: This is my personal blog. The views expressed on this page are mine alone and not those of my employer.

This post is about JavaScript performance but I would like to start it by telling a story that might seem unrelated to JS. Please bear with me if you don’t like C.

A story of a C programmer writing JavaScript

Mr. C. is a C programmer as you can probably guess from his name. Today he was asked by his boss to write a very simple function: given an array of numbered 2d points calculate vector sum of all even numbered points... He opens his favorite text editor and quickly types something like (I’ll be intentionally skipping some #include boilerplate):

typedef struct {
  int32_t n;
  double x;
  double y;
} Point;

Point arrayofpoints_sum(Point* points, size_t n) {
  Point sum = { 0, 0, 0 };

  for (size_t i = 0; i < n; i++) {
    if ((points[i].n & 1) == 0) {
      sum.x += points[i].x;
      sum.y += points[i].y;
    }
  }

  return sum;
}

Oh, that was pretty easy… There are still some time left, so mr. C. throws in some array creation and benchmarking code:

Point* arrayofpoints_create(size_t n) {
  Point* points = (Point*) malloc(n * sizeof(Point));
  for (size_t i = 0; i < n; i++) {
    points[i].n = i;
    points[i].x = i * 0.1 + 0.1;
    points[i].y = i * 0.9 - 0.1;
  }
  return points;
}

void arrayofpoints_free(Point* array) {
  free(array);
}

static uint64_t now() {
  struct timeval t;
  gettimeofday(&t, NULL);
  return ((uint64_t) t.tv_sec) * 1000000 + (uint64_t) t.tv_usec;
}

void test_arrayofpoints() {
  static const size_t kArraySize = 10000;
  static const size_t kIterations = 10000;

  uint64_t create_total = 0;
  uint64_t sum_total = 0;

  for (size_t i = 0; i < kIterations; i++) {
    uint64_t t1 = now();
    Point* array = arrayofpoints_create(kArraySize);
    uint64_t t2 = now();
    Point sum = arrayofpoints_sum(array, kArraySize);
    uint64_t t3 = now();
    assert((int)sum.x == 2500000);
    assert((int)sum.y == 22495000);
    assert(sum.n == 0);
    arrayofpoints_free(array);

    create_total += (t2 - t1);
    sum_total += (t3 - t2);
  }

  printf("create: %lld [%.2f per iteration] usec, sum: %lld [%.2f per iteration] usec\n",
         create_total, (double)create_total / kIterations,
         sum_total, (double)sum_total / kIterations);
}

int main(int argc, char* argv[]) {
  test_arrayofpoints();
  return 0;
}

His code seems to compile and run just fine:

% gcc --std=c99 -O3 -o point point.c
% ./point
create: 508075 [50.81 per iteration] usec, sum: 214779 [21.48 per iteration] usec

Suddenly mr. C gets a call from his boss. “We’ve decided to rewrite our apps in JavaScript. It is pretty fast nowdays, ya know.” says the boss and hangs up. Mr. C. does not like JavaScript as much as he likes C but he is one of those programmers who can write programs in any language:

function Point(n, x, y) {
    this.n = n;
    this.x = x;
    this.y = y;
}

function sumArrayOfPoints(points) {
    var sum = new Point(0, 0, 0);

    for (var i = 0; i < points.length; i++) {
        if ((points[i].n & 1) === 0) {
            sum.x += points[i].x;
            sum.y += points[i].y;
        }
    }

    return sum;
}

function createArrayOfPoints(n) {
    var points = new Array(n);
    for (var i = 0; i < n; i++) {
        points[i] = new Point(/* n */ i, /* x */ i * 0.1 + 0.1, /* y */ i * 0.9 - 0.1);
    }
    return points;
}

function now() {
    return Date.now() * 1000;
}

function assertStrictlyEqual(expected, value) {
  if (expected !== value) {
    throw new Error("Assertion failed: expected " + expected + " got " + value);
  }
}

function testArrayOfPoints() {
    var kArraySize = 10000;
    var kIterations = 10000;

    var createTotal = 0;
    var sumTotal = 0;

    for (var i = 0; i < kIterations; i++) {
        var t1 = now();
        var array = createArrayOfPoints(kArraySize);
        var t2 = now();
        var sum = sumArrayOfPoints(array);
        var t3 = now();
        assertStrictlyEqual(2500000, sum.x | 0);
        assertStrictlyEqual(22495000, sum.y | 0);
        assertStrictlyEqual(0, sum.n);

        createTotal += (t2 - t1);
        sumTotal += (t3 - t2);
    }

    console.log("create: " + createTotal + " [" + (createTotal / kIterations) + " per iteration] usec," +
                " sum: " + sumTotal + " [" + (sumTotal / kIterations) + " per iteration] usec\n");
}

testArrayOfPoints();

That was very easy, but the numbers don’t look as impressive as mr. C. or his boss expected:

% ~/src/v8/d8 point1.js
create: 4629000 [462.9 per iteration] usec, sum: 2056000 [205.6 per iteration] usec

JavaScript program is approximately 10 times slower than original C program. Hmm thinks mr. C. and starts digging.

If C is a stone then JavaScript is a fog

C is a low-level programming language. People call it a portable assembly sometimes. You just take a piece of memory and say: hey, this smallish sequence of 24 bytes is actually a Point which has one uint32_t field followed by two double fields.

Small chunk of unused memory between n and x appears to make x and y fields nicely aligned. CPUs ♥ aligned doubles. On some architectures you can’t even read an unaligned one directly from memory into a floating point register (you’ll get punished if you try).

Arrays in C are also nice and tight: you just take a bigger region of memory and assume that it contains a sequence of Points one after another.

But in JavaScript things are destined to get hairy. You can’t take a flat piece of memory and call it Point with this and that as fields. Instead you have an extremely flexible thing called Object. Object can have properties which can come and go as they wish (almost). You can write constructor:

function Point(n, x, y) { /* ... */ }

but nothing actually prevents you from modifying Point instances after they were created. VM that runs JavaScript has to be always ready for virtually anything that can happen with your object (new properties added, old properties deleted, property descriptor modified). Because of this flexibility it can’t pack Point objects into 24 bytes and has to go with something like this (example layout used by V8):

V8 actually tries to make the object more struct like. It attempts to figure out how much space should be preallocated directly inside the object for properties by looking at the constructor and/or doing allocation profiling. Some VMs just use a fixed constant or even store all properties outside of an object in a separate array. Another thing to notice is that V8 always boxes numbers that are not 31-bit integers (32-bit on x64). Some VMs (e.g. SpiderMonkey and JSC) don’t box doubles but instead use NaN-tagging instead. 

Nevertheless Point object will be 3.3 times as big on x64 version of V8 than it was in C program and it will not be continuously allocated in memory (doubles are accessed via indirection). There are three additional fields: pointer to hidden class (known as Map in V8, Shape in SpiderMonkey, Structure in JavaScriptCore) which captures and describes layout of the object, pointer to out of object properties backing store and pointer to elements backing store. All these fields are required because object can be mutated in an unpredictable ways after it’s creation. 

Array of points object looks basically the same:

Array is a JavaScript object and thus it also has hidden class, pointer to the properties backing store and pointer to the elements backing store. Backing stores can have different representations. For example VMs usually try to represent sparse arrays as dictionaries and non-sparse arrays should have flat backing stores. In our program array of points is always non-sparse and it gets a flat backing store. 

The mathematics of performance is simple: each point of flexibility and each indirection (arrow on the picture) adds additional runtime cost. Evaluating simply looking expression points[i].x in JavaScript requires answering to an expensive questions: “What is points? How do I get property called i from it? How then I get x?”. Fully statical analysis of the program is too expensive to be used to answer these questions. Instead modern JITs reduce these costs by adapting generated code to the program while it runs via techniques like inline caching and adaptive compilation pipelines. Simply speaking: because in JavaScript an application can’t tell compiler points is an array of Point objects” via types declarations JIT has to eavesdrop to notice that. 

If you want to get good performance you need to speak loud enough for JIT compiler to hear you and adapt to your application. This means: your objects should have stable layouts, property access and call sites better be monomorphic, no funny language features (e.g. with or eval) in sight, using typed arrays where appropriate instead of pure JS arrays etc. In a sense you’ll be writing statically typed code in a language that does not support static typing. 

“Wait a second!”  somebody might say here “Mr. C’s code above is pretty damn type stable and yada-yada. But it’s still 10x times slower! Can he make it faster?”

Yes, he can but he will have to

Write JavaScript as if it’s assembly generated by C compiler

To get as close as possible to raw memory from JavaScript we have to use WebGL typed arrays. It’s easy for a JIT to befriend those guys and optimize the hell out of reads and writes, because they have a nice semantics for their backing stores: no nasty holes leading to prototype lookups, all elements have known primitive type and no boxing is required.

var blocks = [];

function Block(size) {
    this.size = size;
    this.buf = new ArrayBuffer(this.size);
    this.i32 = new Int32Array(this.buf);
    this.f64 = new Float64Array(this.buf);
}

function malloc(N) {
    if (blocks[N] && blocks[N].length) return blocks[N].pop();
    return new Block(N);
}

function free(addr) {
    (blocks[addr.size] || (blocks[addr.size] = [])).push(addr);
}

This is a very naïve implementation (I would even say parody) of the famous malloc&free duo. It does not try to optimize memory usage at all but it is perfect for our demonstration.

var $stack = malloc(8 * 1024);
var $sp = 0;

Original C program returns sum on the stack so will have our toy stack to mimic that.

function createArrayOfPoints(n) {
    var points = malloc(24 * n);
    var points_i32 = points.i32;
    var points_f64 = points.f64;
    for (var i1 = 0, i2 = 1, i = 0; i < n; i++, i1 += 6, i2 += 3) {
        points_i32[i1]     = /* n */ i; 
        points_f64[i2]     = /* x */ i * 0.1 + 0.1; 
        points_f64[i2 + 1] = /* y */ i * 0.9 - 0.1;
    }
    return points;
}

Creating array of points is simple: you malloc an array and fill it. It might be hard to read this code and see how it relates to the first JavaScript version or even to the original C function but that’s because it actually tries to mimic one of the optimizations every good C compiler tries to do: strength reduction. Calculating points[i] as (byte*)points + sizeof(Point) * i might be wasteful because multiplication is expensive compared to addition so good C compiler will try to replace points[i] with a new artificial induction variable p' that starts as p' = points[i] and is advanced as p += sizeof(Point).

function sumArrayOfPoints(points, n) {
    var x = 0;
    var y = 0;

    var points_i32 = points.i32;
    var points_f64 = points.f64;
    for (var i1 = 0, i2 = 1, k = 6 * n; i1 < k; i1 = i1 + 6, i2 = i2 + 3) {
        if ((points_i32[i1] & 1) === 0) {
            x += points_f64[i2]; 
            y += points_f64[i2 + 1];
        }
    }

    // Caller should have reserved space for the return value.
    var retval_addr = $sp - 24;
    $stack.i32[retval_addr >> 2] = 0;
    $stack.f64[(retval_addr + 8) >> 3] = x;
    $stack.f64[(retval_addr + 16) >> 3] = y;
    return retval_addr;
}

Summing loop was translated from C in the same way as initialization loop in createArrayOfPoints but there is another important classical optimization demonstrated here: scalar replacement. Smart compiler sees (via escape analysis usually) that Point sum does not have to exist as a continuos entity on the stack. Instead it can be exploded and its fields becomes separate variables (register allocator later puts them into registers). Result of the sumArrayOfPoints is returned on the “stack” just like in the original C version.

The rest of code is unchanged but minor changes are required in the testArrayOfPoints function:

        var t1 = now();
        var array = createArrayOfPoints(kArraySize);
        var t2 = now();
        $sp += 24;  // Reserve space for return value on the "stack".
        sumArrayOfPoints(array, kArraySize);
        $sp -= 24;
        var sum_n = $stack.i32[$sp >> 2];
        var sum_x = $stack.f64[($sp + 8) >> 3];
        var sum_y = $stack.f64[($sp + 16) >> 3];
        var t3 = now();
        assertStrictlyEqual(2500000, sum_x | 0);
        assertStrictlyEqual(22495000, sum_y | 0);
        assertStrictlyEqual(0, sum_n);
        free(array);

We have to take care of managing stack space for stack allocated sum object and we should remember to free allocated array of points. We are using JavaScript as assembly after all…

% ~/src/v8/d8 point.c.js
create: 532000 [53.2 per iteration] usec, sum: 988000 [98.8 per iteration] usec

As you can see in exchange for completely unreadable JavaScript code we got a nice 9x speedup on creation and 2x speedup on sum calculation. 

If you are curious enough to look at the code V8 generates for the sum loop and compare it with code generated from C by GCC you will notice that while it is pretty tight V8 does not understand that i32 and f64 are actually the same array and treats them separately. There are also bounds checks that are not there for C code.

Don’t be charmed by the sweet spot

Roughly a week ago a friend of mine sent me a link to Broadway.js accompanied by numerous expletives to better convey his excitement with this project and emscripten in general: “Look how mighty fast JavaScript became, old chap!”. 

My friend has fallen into the logical trap without even noticing it. Good performance demonstrated by some inhumanly written piece of code (e.g. translated from C via optimizing compiler) does not prove that “JavaScript is already fast enough” unless you are planning on writing your webapps in C for the rest of your life or you are capable of writing such inhuman code yourself just like mr. C. Instead it proves that “Certain features of JavaScript when used in certain way lead to good enough performance.” Efficiency on a tight computational kernel does not necessarily translate into efficiency on different workloads. I call this the trap of the sweet spot.

It’s not JavaScript that becomes faster and faster. It’s JavaScript’s sweet spot. 

You can hit the sweet spot, no doubt about that. But you have to follow rules to hit it. And those rules are strict (application runs best when it’s actually pretty static in it’s behavior) and sometimes unknown (different VMs rely on different heuristics). And hitting the sweetest part requires some strange engineering decisions as demonstrated by the mr. C.’s example above.

Obviously JavaScript VMs are nowhere near the end of the performance improvements road and the language itself is going to get features like binary data that would make hitting sweet-sweet spot much easier. But I doubt that the language as whole is going to become performance sweet.

Real world performance is not as shallow as some micro-benchmark score. It has at least three very important facets: speed, memory, VM’s complexity.

If you look at existing JavaScript VMs you will notice that they have to pay a lot in terms of complexity to overcome ineffective execution or memory utilization. And implementation complexity is the worst kind of complexity in the world: it leads to subtle bugs that occur on 0.000001% of programs, it leads to unpredictable and hard to analyze performance, it leads to volumes of rulebooks that specify how to appease the compiler.

JavaScript is a complex and expensive language in it’s very core. Keep that in mind and don’t be charmed by the sweet spot. 

But I must admit: working on a JavaScript VM and making sweet spot larger and easier to hit is an interesting challenge that make one burst with excitement and stresses the brain. :-)

Comments
Dangers of cross language benchmark games

Disclaimer: This is my personal blog. The views expressed on this page are mine alone and not those of my employer.

Every time I stumble upon yet another cross language benchmark comparison I remember Fight Club. No, seriously! What can be more enjoyable than watching your favorite language pummel not-so-favorite languages? Developers love cross language benchmark games. There is no doubt about that. But they tend to forget the first rule of benchmark club:

The first rule of benchmark club is you do not talk about benchmark club jump to conclusions.

A couple of weeks ago one of my friends sent me a link to Programming Language Benchmarks by Attractive Chaos, a younger brother of The Computer Language Benchmarks Game maintained by Isaac Gouy, and asked for some insight into V8’s performance on matmul_v1.js benchmark. 

So I cloned official PLB repo, slightly patched matmul_v1.js to make it work with V8’s shell instead of d8 (which I do not use) and started digging. 

% svn info $V8 | grep Revision
Revision: 7810
% time $V8/shell matmul_v1.js
-95.58358333329998
$V8/shell matmul_v1.js  11.16s user 0.15s system 100% cpu 11.264 total

(measurements for this post were done on V8 r7810 so results are slightly different from the ones for r7677 shown in PLB table)

Yikes. 11s are not very exciting especially when PyPy and LuaJIT2 show 8.5s and 2.7s respectively. 

Foes of number representation

The only number type JavaScript and Lua provide is double precision floating point and requires 64 bits of memory. Thus VM implementers have a choice either to make every slot (local variables, objects properties, array elements) wide enough to contain a 64-bit double or to store 64-doubles as boxed values.

V8 goes with the latter approach and boxes every number that does not fit into 31-bit integer range. Here is for example a backing store of an array [3.14, 1, 2.71, 2] on 32-bit architecture:

LuaJIT2 (as well as JSC and SpiderMonkey) on the other side uses a technique called NaN-tagging, which allows to store both pointers (or other 32-bit values) and doubles in 64-bit wide slots. Here is how the same array will look like in these VMs:

V8’s approach allows to save heap space when numbers mostly fit into 31-bit integer range but it also incurs overhead of double indirection and extra allocation when application starts working with dense arrays of floating point numbers. Fortunately there is a solution: WebGL typed arrays

- It should be noted here that V8’s optimizing backend is able to keep local double values and temporaries on the stack and in xmm registers. Boxing only occurs when the value escapes optimized code (e.g. is stored to a property, passed as an arguments or returned from a function). Non-optimized code always works with boxed numbers.

Float64Array

Typed array constructed with new Float64Array(N) behaves just like a normal JS object almost in every aspect:

% $V8/shell
V8 version 3.3.5 (candidate)
> arr = new Float64Array(10)
[object Object]
> arr.prop = "this is prop";
this is prop
> Object.keys(arr)
0,1,2,3,4,5,6,7,8,9,length,BYTES_PER_ELEMENT,prop
> arr.prop
this is prop

The only observable difference is in semantics of indexed properties. This specialized semantics allows V8 to use unboxed backing stores for such typed arrays.

Will it be cheating to use such specialized data type? I do not think so. We are trying to get some real world data and educate programmers about strength and weaknesses of a particular VM. Nobody uses std::list<std::list<double> > to represent dense matrices in C++. The same applies to all participating languages. For example PyPy’s version of matmul benchmark matmul_v2.py uses module array, which provides Python’s equivalent of typed arrays. Mike Pall contributed matmul_v2.lua that relies on a low-level FFI type which is not even safe unlike it’s Python and JavaScript counterparts:

% ./luajit-2.0/src/luajit
LuaJIT 2.0.0-beta6 -- Copyright (C) 2005-2011 Mike Pall. http://luajit.org/
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc fuse
> ffi = require 'ffi'
> arr = ffi.new 'double[10]'
> for i = 0, 10000 do arr[i] = 0 end
zsh: bus error  ./luajit-2.0/src/luajit

So here is matmul_v2.js a slightly altered version of matmul_v1.js that uses Float64Arrays to store matrix rows. Some would probably notice that I did not even bother to manually hoist row’s loads like somebody did for matmul_v1.js, because V8’s optimizing backend is able to perform loop invariant code motion in this case.

% time $V8/shell matmul_v2.js
-95.58358333329998
$V8/shell matmul_v2.js  2.61s user 0.07s system 101% cpu 2.650 total

Using the right representation, as you can see, gives a nice 4.2x speedup which puts V8 pretty close to LuaJIT2 and low-level statically typed languages like C.

But LuaJIT2 is still faster! Why?

Indeed. LuaJIT2 is still faster.

% time $LUAJIT/luajit matmul_v1.lua 1000
-95.5835833333
$LUAJIT/luajit matmul_v1.lua 1000  2.26s user 0.05s system 99% cpu 2.325 total

There are two main reasons:

  • V8 is missing array bounds check elimination. LuaJIT2 is able to hoist bounds checks out of the hot loop, V8 does not try to do that.
  • V8 has termination and debugging API while LuaJIT2 does not. For example any script on the webpage can be interrupted from Chrome DevTools by the pause button. To support these APIs V8 has to insert an interruption check on the backedge of every loop.

When did you have to multiply two matrices last time?

Most of programmers don’t have to deal with matrix multiplication, signal processing and DNA sequencing every day, but it’s easy to forget about that when interpreting results of cross language benchmark games, especially if your favorite language implementation is da winner.

But it’s really dangerous to base your language choice on such comparisons as they are pretty much one sided (tight numeric loops) and do not cover all important facets of a language implementation.  

Out of curiosity I’ve rewritten DeltaBlue, constraint solver written in a classical object-oriented style, included into V8 Benchmark Suite from JavaScript into Lua.

When porting benchmark’s code from JavaScript to Lua I’ve used metatables to implement object hierarchy. This approach is widespread in Lua community and is described in the book written by language authors. DeltaBlue does not use any weird JavaScript features so rewrite was very smooth and mostly done by Emacs’s “Replace Regexp” :-) I’ve used Lua for 5 years so I am also pretty sure that result is as close to idiomatic Lua as possible. Original benchmark uses global namespace, which is considered bad taste in Lua, so I created two Lua versions: one directly converted from JavaScript (global variables became global variables) and one with all global variables turned into local variables.

Surprisingly LuaJIT2 does not perform very well on this benchmark:

% time $LUAJIT/luajit deltablue-10000iterations.lua globals
$LUAJIT/luajit deltablue-10000iterations.lua globals  55.03s user 0.09s system 99% cpu 55.630 total
% time $LUAJIT/luajit deltablue-10000iterations.lua locals
$LUAJIT/luajit deltablue-10000iterations.lua locals  26.72s user 0.04s system 99% cpu 26.824 total
% time $V8/shell deltablue-10000iterations.js
$V8/shell deltablue-10000iterations.js  4.64s user 0.04s system 89% cpu 5.213 total

As you can see V8 is roughly 5-12x times faster. Does it mean that LuaJIT2 is a worse compiler? Definitely not. It’s probably just not tuned for this kind of workload. (There is also a slight possibility that I screwed somewhere when doing search&replace operations).

Conclusion

Everything has it’s own strength and weaknesses. VMs, compilers and even benchmark suites. There is no simple answer when the question is “which language is better?”. Even more: to compare two languages you need to be adept in both. 

In the beginning of this post I’ve mentioned the first rule of benchmark club and I think it’s appropriate to conclude by reminding you about the second rule.

The second rule of benchmark club is you DO NOT jump to conclusions.

Comments
Improved V8 external arrays support and nodejs Buffer type

Recently V8’s optimizing pipeline (Crankshaft) got full support for external arrays (aka WebGL typed arrays). One of core NodeJS types - Buffer - is exposed to JavaScript code as an external unsigned byte array so I decided to do a small unscientific benchmark to see the improvement with my own eyes:

var LONG_STRING = "qwertyuiop";

for (var i = 0; i < 13; i++) LONG_STRING += LONG_STRING;

// force cons-string flattening outside of timed loop.
LONG_STRING.charCodeAt();

console.log("LONG_STRING is %s chars long", LONG_STRING.length);

function Str2BufferJS(s) {
  var length = s.length;
  var b = new Buffer(length);
  for (var i = 0; i < length; i++) b[i] = s.charCodeAt(i);
  return b;
}

function Str2BufferNative(s) {
  var length = s.length;
  var b = new Buffer(length);
  b.asciiWrite(s, 0);
  return b;
}

var N = 1000;

function LoopAndTime(name, f) {
 var start = Date.now();
 for (var i = 0; i < N; i++) f();
 var end = Date.now();
 console.log("%s took %s ms per %s calls", name, end - start, N);
}

LoopAndTime("Str2BufferNative", function() { Str2BufferNative(LONG_STRING); });
LoopAndTime("Str2BufferJS", function() { Str2BufferJS(LONG_STRING); });

Results were fascinating:

# node 0.4.4 with V8 3.1.8.5
LONG_STRING is 81920 chars long
Str2BufferNative took 224 ms per 1000 calls
Str2BufferJS took 1180 ms per 1000 calls
# node 0.4.4. with V8 3.2.7 (current bleeding_edge)
LONG_STRING is 81920 chars long
Str2BufferNative took 254 ms per 1000 calls
Str2BufferJS took 231 ms per 1000 calls

As you can see JS function optimized by Crankshaft is now on par with native implementation of Buffer.writeAscii.

Comments