Page 1 of 1

Research into NativeRegExpMacroAssembler.cpp and Mozilla's assembly language.

Posted: 2020-01-22, 06:11
by athenian200
I'm not sure if this is the right place to post this, but I felt like I should put my research up somewhere on the forums in case anyone has a need to deal with Mozilla's assembly language or the RegExp macro assembler in the future.

Well, I've been looking into this because of that lookbehind issue. The C++ part of the implementation seems perfectly fine, but the problems occur when things get handed off to the macro assembler here in RegExpEngine.cpp:

Code: Select all

 // If we advance backward, we may end up at the start.
        read_backward() ? -Length() : Length(), compiler);
The syntax here is a mildly cryptic shorthand, but for clarity's sake, it basically means roughly the same as this:

Code: Select all

 // If we advance backward, we may end up at the start.
    if (read_backward()) {
        successor_trace.AdvanceCurrentPositionInTrace(-Length(), compiler);
    else {
        successor_trace.AdvanceCurrentPositionInTrace(Length(), compiler);
Which also crashes, but is at least easier to read.

Essentially, when reading backward, it feeds a negative length into the macro assembler, and that is apparently what causes it to crash. Our macro assembler doesn't like the negative value here, and the reason why can be gleaned from examining Google's code changes. At first, it may look like the two implementations don't have much to do with each other: ... 5c2055d1c1 ... embler.cpp

But if you compare NativeRegExpMacroAssembler.cpp with very closely, it becomes apparent what happened. Mozilla translated all of the assembler instructions here in the macro assembler into some weird Mozilla-specific JIT assembler language that bears little resemblance to standard x86 assembly. I can kind of understand what's going on in the Google code because I have a passing familiarity with x86 assembler, but I've never seen anything like what we have before.

After playing around with the parts of our code that correspond to the parts Google changed in I notice that I can change it up enough to where the crash doesn't happen at all but the macro assembler has bugs, or change it so the crash happens on startup with the same error code regardless of what web page is loaded. So I don't actually know how to fix it, but I am becoming convinced that there's a link here.

All I've been able to determine about this mysterious assembler language is this:

Code: Select all

esi = input_end_pointer
edx = current_character
edi = current_position
ebp = StackPointer
eax = temp0 (usually)
ebx = temp1 (usually)
And also, arguments are usually reversed. So something like:

Code: Select all

__ mov(edx, register_location(start_reg));  // Index of start of capture
__ mov(ebx, register_location(start_reg + 1));  // Index of end of capture
Would be turned into:

Code: Select all

masm.loadPtr(register_location(start_reg), current_character);  // Index of start of capture
masm.loadPtr(register_location(start_reg + 1), temp1);  // Index of end of capture
The main problem is that not all the instructions have such a straightforward translation. cmp instructions are translated into masm.branchPtr instructions in a way that's fairly confusing, lea instructions usually correspond to masm.computeEffectiveAddress in some way... but it's not a 1:1 mapping process. What makes porting the changes to the macro assembler over particularly tough is that functions needed to make lookbehind work properly with the macro assembler are indeed these tougher ones, loaded with painful-to-translate functionality (at least for anyone who isn't extremely familiar with both x86 assembler and this "masm" language).

I've found a brief overview of this language talked about in this blog: ... te-values/

Re: Research into NativeRegExpMacroAssembler.cpp and Mozilla's assembly language.

Posted: 2021-07-21, 22:29
by Moonchild
Digging this one back up i looked into the code. We don't even need to dig into it too deeply because i think I found the problem:

Code: Select all

Trace::AdvanceCurrentPositionInTrace(int by, RegExpCompiler* compiler)
    MOZ_ASSERT(by > 0);
    // We don't have an instruction for shifting the current character register
    // down or for using a shifted value for anything so lets just forget that
    // we preloaded any characters into it.
    characters_preloaded_ = 0;
    // Adjust the offsets of the quick check performed information.  This
    // information is used to find out what we already determined about the
    // characters by means of mask and compare.
    quick_check_performed_.Advance(by, compiler->ascii());
    cp_offset_ += by;
    if (cp_offset_ > RegExpMacroAssembler::kMaxCPOffset) {
        cp_offset_ = 0;
    bound_checked_up_to_ = Max(0, bound_checked_up_to_ - by);
The issue is twofold:
resetting the preload to 0 then quick-check advancing it by a negative value will of course not work, AND
the very last line; while the offset is fine because a += with a negative value will properly adjust the character offset, the "- by" in the bound check will end up doing a double negative = addition, and may overflow, which a Max() will not catch.

So with that I think it's possible to fix that, easy if we can grab the full length to calculate the offset properly, otherwise needing a bit of juggling.

Of course the Assert at the top should go away too if we make it so it can handle negative values properly as well.

We'd also need to adapt the following to like negative numbers so the quick check advance works:

Code: Select all

QuickCheckDetails::Advance(int by, bool ascii)
    MOZ_ASSERT(by >= 0);
    if (by >= characters_) {
    for (int i = 0; i < characters_ - by; i++) {
        positions_[i] = positions_[by + i];
    for (int i = characters_ - by; i < characters_; i++) {
        positions_[i].mask = 0;
        positions_[i].value = 0;
        positions_[i].determines_perfectly = false;
    characters_ -= by;
    // We could change mask_ and value_ here but we would never advance unless
    // they had already been used in a check and they won't be used again because
    // it would gain us nothing.  So there's no point.