Bulletproof JavaScript benchmarks

Published 23rd December 2010 · tagged with JavaScript, performance

Note: The following article, written by John-David Dalton and yours truly, was published as part of the Performance Calendar series in 2010.

Writing JavaScript benchmarks isn’t as simple as it seems. Even without touching the subject of potential cross-browser issues, there are a lot of pitfalls — boobytraps, even — to look out for.

This is part of the reason why I created jsPerf, a simple web interface that allows you to very easily create and share test cases comparing the performance of different code snippets. There’s no need to worry about anything; just enter the code you would like to benchmark and have jsPerf create a test case for you which can be run across different browsers and devices.

Behind the scenes, jsPerf was initially using a JSLitmus-based benchmarking library which I named Benchmark.js. More and more features were added, and recently, John-David Dalton rewrote the whole thing from scratch. Benchmark.js has been getting better ever since.

This article will shed some light on the various gotchas in writing and running JavaScript benchmarks.

Benchmarking patterns

There are a lot of ways to run benchmarks on JavaScript snippets to test their performance. The most common pattern is the following:

Pattern A

var totalTime;
var start = new Date;
var iterations = 6;
while (iterations--) {
	// Code snippet goes here.
}
// `totalTime` is the number of milliseconds it took to execute the code snippet 6 times.
totalTime = new Date - start;

This places the code to be tested inside a loop and executes it a predefined number of times (in this case, 6). After that, the start date is subtracted from the end date to get the time taken to perform the operations.

Pattern A is used in the popular SlickSpeed, Taskspeed, SunSpider, and Kraken benchmark suites.

The problem(s)

As browsers and devices get faster, benchmarks that use fixed iteration counts have a greater chance of producing 0 ms results, which are unusable.

Pattern B

Another approach is to calculate how many operations are performed in a period of time. This has the advantage of not requiring you to pinpoint a number of iterations, as in the previous example.

var hz;
var period;
var startTime = new Date;
var runs = 0;
do {
	// Code snippet goes here.
	runs++;
	totalTime = new Date - startTime;
} while (totalTime < 1000);

// Convert milliseconds to seconds.
totalTime /= 1000;

// period → how long each operation takes
period = totalTime / runs;

// hz → the number of operations per second.
hz = 1 / period;

// This can be shortened to:
// hz = (runs * 1000) / totalTime;

This snippet executes the test code for about a second, i.e. until totalTime is greater than or equal to 1000 ms.

Pattern B is used in Dromaeo and the V8 Benchmark Suite.

The problem(s)

When benchmarking, results will vary from test to re-test due to garbage collection, engine optimizations, and other background processes. Because of this variance a benchmark should be run several times to get an average result. V8 Suite only runs each benchmark once. Dromaeo runs each benchmark five times, but could do more in an effort to reduce its margin of error. One way would be to lower the minimum time a benchmark runs from 1000 ms to 50 ms, assuming a non-buggy timer, allowing more time for repeated runs.

Pattern C

JSLitmus is built around a combination of both these patterns. It uses pattern A to loop a test n times, but uses adaptive test cycles to dynamically increase n until a minimum test time, pattern B, is reached.

The problem(s)

JSLitmus avoids the issues of pattern A but shares the problems of pattern B. In an effort to increase result accuracy, JSLitmus calibrates results by taking the fastest of 3 empty test runs and subtracting the result from each benchmark result. Unfortunately, this technique — while intended to remove overhead cost — actually muddies the end result because “best of 3” is not a statistically valid method. Even if JSLitmus ran benchmarks multiple times and subtracted the calibration average from the benchmark result average, the end result’s increased margin of error would swallow any hope of increased accuracy.

Pattern D

The drawbacks of patterns A, B, and C can be avoided by using function compilation and loop unrolling.

function test() {
	x == y;
}

while (iterations--) {
	test();
}

// This would compile to:

var hz;
var startTime = new Date;

x == y;
x == y;
x == y;
x == y;
x == y;
// …

hz = (runs * 1000) / (new Date - startTime);

This pattern compiles the tests unrolled to avoid looping and calibration.

The problem(s)

However, it also has its downsides. Compiling functions like this can drastically increase memory usage and slow down your CPU. When you’re repeating a test a few million times, you’re basically creating a very large string and compiling a massive function.

Another caveat when using loop unrolling is that tests can exit early via a return statement. There’s no point in compiling a million-line function that will return at line 3 anyway. It’s necessary to detect early exits and fall back to the while loop (pattern A) with loop calibration when needed.

Function body extraction

In Benchmark.js, a slightly different technique is used. You could say it uses the best parts of patterns A, B, C, and D. Because of memory concerns, we’re not unrolling loops. In order to reduce factors that might make results less accurate, and to allow tests to access local methods and variables, we extract the function body for each test. For example, when code like this is tested:

var x = 1;
var y = '1';

function test() {
	x == y;
}

while (iterations--) {
	test();
}

// This would compile to:

var x = 1;
var y = '1';
while (iterations--) {
	x == y;
}

After that, Benchmark.js uses a similar technique to JSLitmus: we run the extracted code in a while loop (pattern A), repeat it until a minimum time is reached (pattern B), and repeat the whole thing multiple times to produce statistically significant results.

Some things to consider

Inaccurate millisecond timers

In some browser/OS combinations, the timers may be inaccurate because of various issues.

For example:

When Windows XP boots, the typical default clock interrupt period is 10 milliseconds, although a period of 15 milliseconds is used on some systems. That means that every 10 milliseconds, the operating system receives an interrupt from the system timer hardware.

Some older browsers (e.g. IE, Firefox 2) rely on the internal OS timers, meaning that every time you call new Date().getTime() it will just fetch it directly from the operating system. Obviously, if the internal timer only gets updated every 10 or 15 milliseconds, the uncertainty in the measurement increases and the accuracy of test results decreases significantly. We need to work around this.

Luckily, it’s possible to use JavaScript to get the smallest unit of measure. After that, we can use a little math to reduce the percentage uncertainty of our test results to 1%. To do this, we need to divide the smallest unit of measure by 2 to get the uncertainty. Let’s say we’re using IE6 on Windows XP, and the smallest unit of measure is 15 ms. In this case, the uncertainty equals 15 ms / 2 = 7.5 ms. We want this number to signify only 1%, so we just divide it by 0.01, which gives us the minimum test time required: 7.5 / 0.01 = 750 ms.

Alternative timers

When run with the --enable-benchmarking flag, Chrome and Chromium expose a chrome.Interval method, which can be used as a high-resolution microsecond timer.

While working on Benchmark.js, John-David Dalton stumbled across Java’s nanosecond timer and promptly exposed it to JavaScript using a tiny Java applet. It would be interesting to see if there are more possibilities here using other browser plugins.

Using a higher resolution timer allows for shorter test times, which allows for larger sample sizes, which produces a smaller margin of error for the results.

Firebug disables Firefox’s JIT

Enabling the Firebug add-on effectively disables all of Firefox’s high-performance just-in-time (JIT) native code compilation, meaning you’ll be running the tests in the interpreter. In other words, your tests will run much slower than they would otherwise. You should always remember to disable Firebug before running benchmarks in Firefox.

Although the impact appears to be much smaller there, the same goes for other browser inspector tools, like WebKit’s Web Inspector or Opera’s Dragonfly. Avoid having these opened when running benchmarks, as it might influence the results.

Browser bugs and features

Benchmarks that have some kind of looping mechanism are susceptible to various browser quirks, as recently IE9’s dead-code-removal demonstrated. Bugs in Mozilla’s TraceMonkey engine, or Opera 11’s caching of querySelectorAll results can also throw a wrench into benchmark results. It’s important to keep that in mind when creating test cases.

Statistical significance

Most benchmarks/benchmarking scripts produce results that aren’t statistically significant. John Resig wrote about this before in his article on JavaScript benchmark quality. In short, it’s necessary to consider the margin of error of each result, and reduce it as much as possible. A larger sample size, composed of completed test runs, helps to reduce the margin of error.

Cross-browser testing

If you want to run benchmarks in different browsers and get reliable results, be sure to test in the real browsers. Do not rely on Internet Explorer’s compatibility modes — these differ from the actual browser versions they’re emulating.

Also, be aware of the fact that rather than limiting a script by time like all other browsers do, IE (up to version 8) limits a script to 5 million instructions. With modern hardware, a CPU-intensive script can trigger this in less than half a second. If you have a reasonably fast system, you may run into “Script Warning” dialogs in IE, in which case the best solution is to modify your Windows Registry to increase the number of operations. Luckily, Microsoft provides an easy way of doing this; all you need to do is run a simple “Fix It” wizard. What’s even better is that this silly limitation is removed in IE9.

Conclusion

Whether you’re just running some benchmarks, writing your own test suite, or even coding your own benchmarking library — there’s more to JavaScript benchmarking than meets the eye. Benchmark.js and jsPerf are updated weekly with small bug fixes, new features, and clever tricks that improve the accuracy of the test results. If only the popular browser benchmarks would do the same…

Comments

Please leave any comments you may have on the original article.