What's the deal with browser benchmarks?

Frequently Asked Questions about the Pale Moon browser and their answers.
User avatar
Moonchild
Pale Moon guru
Pale Moon guru
Posts: 35402
Joined: 2011-08-28, 17:27
Location: Motala, SE
Contact:

What's the deal with browser benchmarks?

Unread post by Moonchild » 2012-04-09, 10:07

What's the deal with browser benchmarks?

Short answer: They aren't always conclusive. They can be biased, incomplete, and should at the most only be seen as an indication and not as hard fact.

(Very) long answer:

1) Definitions

I'll be using a few terms in this answer that may not be immediately clear to people:
  • JS: Short for "JavaScript", the universal scripting language that is used in webpages
  • JIT, and jitted: JIT is a "Just-In-Time" compiler, which converts text-based scripting into a machine code equivalent when it is encountered for (much) faster execution. If a function is "jitted", this means that code has been converted this way and isn't "interpreted"
  • Pure JS: This stands for "Pure JavaScript", and with this I mean the functions of JS that are most likely to be jitted like math operations, bitwise operations, etc.
  • DOM: the Document Object Model, and object-oriented structure that is the meat and potatoes of dynamic webpages
2) The Benchmarks

To know what exactly is tested and how it's tested, a closer look at the different popular benchmarks first:
  • Sunspider, Kraken and V8: These test almost exclusively Pure JS
  • Dromaeo: A benchmark that tests a mix of Pure JS, interpreted JS and DOM/CSS
  • Peacekeeper (Futuremark): A benchmark that tests a mix of JS, DOM and graphical elements
3) What is tested and what is not tested

All of the benchmarks rely very heavily on JS and its execution. Although JS is very important for modern webpages, it is certainly far from the end-all of things. just as important are the speeds at which DOM and CSS are handled, how HTML is parsed by the browser, and how the compositor works. In addition, graphic rendering speed and network speed and buffering are important, as well as how efficiently the browser handles its memory. As a result, all of the current browser benchmarks out there don't provide you with a full picture of how a browser performs overall. Sunspider, Kraken and V8 can be considered the least interesting benchmarks because they only really focus on the JIT part of JS, and don't look at all at any of the other parts of the browser. Dromaeo is a bit better, but still a very heavily JS focused test (the title says so, even). Peacekeeper extends the range a bit further by actually adding some rendering tests, although this benchmark has its own set of issues like the lack of statistical confidence and relying very heavily on hardware (and as such more of a hardware test than a browser test), and it still doesn't include HTML parsing, networking tests, memory handling.

This means that all of these benchmarks only give a partial image. With the current JIT compilers you can even consider this negligible unless there are differences of more than several times the performance of another.

4) 32-bit versus 64-bit and tight loops

Benchmarks invariably perform their tests in "tight loops", which means a small bit of code that is looped through rapidly many times. If properly calibrated (meaning looping through it without performing anything to get a "reference" value) this kind of tight loop can quite efficiently measure how well a specific, single instruction is executed. This is, however, not the kind of behavior you'd encounter when browsing webpages, where you would normally have a large number of different instructions one after the other. Testing in tight loops may therefore not give you any sort of conclusive result; it can only be seen as an indication of what overall browser speed could be if not influenced by other factors.

This also brings me to testing 32-bit and 64-bit browsers against each other. Because of these tight loops and functions in a browser usually not using 64-bit address space or big variables, the benchmarks won't specifically test what a 64-bit browser is strong at. Tight loop testing also brings in another factor: because a 64-bit processor uses twice as large registers, you are inherently pushing twice as much data through your hardware. Tight loops are specifically sensitive to the amount of data that is passed to and from the processor, and you should always take this into account when comparing test results between a 32-bit and 64-bit browser. You can expect 64-bit browsers to score (quite a bit) lower than their 32-bit counterparts; this doesn't mean that the browser is slower, though, just that the test isn't measuring the full capacity of the browser.

5) Bias, bugs, and cheating

Benchmarks can be (horribly) biased to favor one particular browser. This is especially the case if the benchmark is created and operated by the people who also put out the browser to be tested (or are affiliated with) - it's not scientifically sound to write a test to prove your own theory, and writing a test to prove your browser faster than any other out there isn't too hard if you bias the test specifically to what you know your browser's strengths are. This is the reason why, even if I could, I have not written a benchmark myself; being the Pale Moon developer, I would (subconsciously) likely be influenced in writing my benchmark by my in-depth knowledge of the code and optimizations, or be accused of such easily enough even if I write it fully objectively.

Benchmarks or browsers can also have bugs that cause the measured results to be wrong. You are inherently using the browser's internal scripting to measure the performance of the browser. You can't rely on using the very thing you test to provide unbiased, objective results.

When building a browser, especially when using profile-guided optimization (PGO, a technique that builds the browser, then lets you run it while recording "normal use" of the browser, and fine-tuning building to that by biasing the final build towards the functions that are actually seen used), you can also cheat benchmarks by making sure the specific functions used in benchmarking are compiled to execute as fast as possible, at the expense of all other functions of the browser. For example, if a build would be made where the "profiling run" is used to do nothing but mathematical calculations, then the final build would be exceedingly good at them, but at the expense of, e.g., rendering graphics or networking.
Mozilla Firefox actually does this: look at the profiling script in the source tree and you can see that it runs through a subset of sunspider. This will force the compiler to focus heavily on those javascript functions and compile them with very heavy bias, making them faster in the final result - at the expense of other functions, because you can't have the cake and eat it too.

6) Conclusion

The conclusion is that benchmarks can't be used to draw hard (or regularly even any) conclusions. Plain and simple: they are an indication, nothing more. They serve well if you compare closely related siblings (e.g. Firefox and Iceweasel) or different builds of the exact same browser, to get a relative performance difference between the two on the limited subset of what is actually tested, but that's about as far as it goes.
"Sometimes, the best way to get what you want is to be a good person." -- Louis Rossmann
"Seek wisdom, not knowledge. Knowledge is of the past; wisdom is of the future." -- Native American proverb
"Linux makes everything difficult." -- Lyceus Anubite

Locked