Short answer: They aren't always conclusive. They can be biased, incomplete, and should at the most only be seen as an indication and not as hard fact.
(Very) long answer:
I'll be using a few terms in this answer that may not be immediately clear to people:
- JIT, and jitted: JIT is a "Just-In-Time" compiler, which converts text-based scripting into a machine code equivalent when it is encountered for (much) faster execution. If a function is "jitted", this means that code has been converted this way and isn't "interpreted"
- DOM: the Document Object Model, and object-oriented structure that is the meat and potatoes of dynamic webpages
To know what exactly is tested and how it's tested, a closer look at the different popular benchmarks first:
- Sunspider, Kraken and V8: These test almost exclusively Pure JS
- Dromaeo: A benchmark that tests a mix of Pure JS, interpreted JS and DOM/CSS
- Peacekeeper (Futuremark): A benchmark that tests a mix of JS, DOM and graphical elements
All of the benchmarks rely very heavily on JS and its execution. Although JS is very important for modern webpages, it is certainly far from the end-all of things. just as important are the speeds at which DOM and CSS are handled, how HTML is parsed by the browser, and how the compositor works. In addition, graphic rendering speed and network speed and buffering are important, as well as how efficiently the browser handles its memory. As a result, all of the current browser benchmarks out there don't provide you with a full picture of how a browser performs overall. Sunspider, Kraken and V8 can be considered the least interesting benchmarks because they only really focus on the JIT part of JS, and don't look at all at any of the other parts of the browser. Dromaeo is a bit better, but still a very heavily JS focused test (the title says so, even). Peacekeeper extends the range a bit further by actually adding some rendering tests, although this benchmark has its own set of issues like the lack of statistical confidence and relying very heavily on hardware (and as such more of a hardware test than a browser test), and it still doesn't include HTML parsing, networking tests, memory handling.
This means that all of these benchmarks only give a partial image. With the current JIT compilers you can even consider this negligible unless there are differences of more than several times the performance of another.
4) 32-bit versus 64-bit and tight loops
Benchmarks invariably perform their tests in "tight loops", which means a small bit of code that is looped through rapidly many times. If properly calibrated (meaning looping through it without performing anything to get a "reference" value) this kind of tight loop can quite efficiently measure how well a specific, single instruction is executed. This is, however, not the kind of behavior you'd encounter when browsing webpages, where you would normally have a large number of different instructions one after the other. Testing in tight loops may therefore not give you any sort of conclusive result; it can only be seen as an indication of what overall browser speed could be if not influenced by other factors.
This also brings me to testing 32-bit and 64-bit browsers against each other. Because of these tight loops and functions in a browser usually not using 64-bit address space or big variables, the benchmarks won't specifically test what a 64-bit browser is strong at. Tight loop testing also brings in another factor: because a 64-bit processor uses twice as large registers, you are inherently pushing twice as much data through your hardware. Tight loops are specifically sensitive to the amount of data that is passed to and from the processor, and you should always take this into account when comparing test results between a 32-bit and 64-bit browser. You can expect 64-bit browsers to score (quite a bit) lower than their 32-bit counterparts; this doesn't mean that the browser is slower, though, just that the test isn't measuring the full capacity of the browser.
5) Bias, bugs, and cheating
Benchmarks can be (horribly) biased to favor one particular browser. This is especially the case if the benchmark is created and operated by the people who also put out the browser to be tested (or are affiliated with) - it's not scientifically sound to write a test to prove your own theory, and writing a test to prove your browser faster than any other out there isn't too hard if you bias the test specifically to what you know your browser's strengths are. This is the reason why, even if I could, I have not written a benchmark myself; being the Pale Moon developer, I would (subconsciously) likely be influenced in writing my benchmark by my in-depth knowledge of the code and optimizations, or be accused of such easily enough even if I write it fully objectively.
Benchmarks or browsers can also have bugs that cause the measured results to be wrong. You are inherently using the browser's internal scripting to measure the performance of the browser. You can't rely on using the very thing you test to provide unbiased, objective results.
When building a browser, especially when using profile-guided optimization (PGO, a technique that builds the browser, then lets you run it while recording "normal use" of the browser, and fine-tuning building to that by biasing the final build towards the functions that are actually seen used), you can also cheat benchmarks by making sure the specific functions used in benchmarking are compiled to execute as fast as possible, at the expense of all other functions of the browser. For example, if a build would be made where the "profiling run" is used to do nothing but mathematical calculations, then the final build would be exceedingly good at them, but at the expense of, e.g., rendering graphics or networking.
The conclusion is that benchmarks can't be used to draw hard (or regularly even any) conclusions. Plain and simple: they are an indication, nothing more. They serve well if you compare closely related siblings (e.g. Firefox and Iceweasel) or different builds of the exact same browser, to get a relative performance difference between the two on the limited subset of what is actually tested, but that's about as far as it goes.