Well, I've been looking into this because of that lookbehind issue. The C++ part of the implementation seems perfectly fine, but the problems occur when things get handed off to the macro assembler here in RegExpEngine.cpp:
Code: Select all
// If we advance backward, we may end up at the start.
successor_trace.AdvanceCurrentPositionInTrace(
read_backward() ? -Length() : Length(), compiler);
Code: Select all
// If we advance backward, we may end up at the start.
if (read_backward()) {
successor_trace.AdvanceCurrentPositionInTrace(-Length(), compiler);
}
else {
successor_trace.AdvanceCurrentPositionInTrace(Length(), compiler);
}
Essentially, when reading backward, it feeds a negative length into the macro assembler, and that is apparently what causes it to crash. Our macro assembler doesn't like the negative value here, and the reason why can be gleaned from examining Google's code changes. At first, it may look like the two implementations don't have much to do with each other:
https://github.com/v8/v8/commit/906903a ... 5c2055d1c1
https://github.com/MoonchildProductions ... embler.cpp
But if you compare NativeRegExpMacroAssembler.cpp with regexp-macro-assembler-ia32.cc very closely, it becomes apparent what happened. Mozilla translated all of the assembler instructions here in the macro assembler into some weird Mozilla-specific JIT assembler language that bears little resemblance to standard x86 assembly. I can kind of understand what's going on in the Google code because I have a passing familiarity with x86 assembler, but I've never seen anything like what we have before.
After playing around with the parts of our code that correspond to the parts Google changed in regexp-macro-assembler-ia32.cc... I notice that I can change it up enough to where the crash doesn't happen at all but the macro assembler has bugs, or change it so the crash happens on startup with the same error code regardless of what web page is loaded. So I don't actually know how to fix it, but I am becoming convinced that there's a link here.
All I've been able to determine about this mysterious assembler language is this:
Code: Select all
esi = input_end_pointer
edx = current_character
edi = current_position
ebp = StackPointer
eax = temp0 (usually)
ebx = temp1 (usually)
Code: Select all
__ mov(edx, register_location(start_reg)); // Index of start of capture
__ mov(ebx, register_location(start_reg + 1)); // Index of end of capture
Code: Select all
masm.loadPtr(register_location(start_reg), current_character); // Index of start of capture
masm.loadPtr(register_location(start_reg + 1), temp1); // Index of end of capture
I've found a brief overview of this language talked about in this blog:
https://paul.bone.id.au/blog/2018/09/14 ... te-values/