So, about this time last month, I was looking for something to distract myself from a stressful situation in real life and keep my mind occupied. I was looking at the Pale Moon source code and noticed they'd removed Solaris support. So I was thinking to myself, "How hard would it be to add it back in and then make the program actually compile and run?" So I simply installed OpenIndiana (basically the official successor to OpenSolaris after Oracle closed that project down) in a virtual machine and got to work despite having no real experience with Solaris, Firefox, or Pale Moon. The only thing I knew about Solaris going in is that it's the "other" Unix they offered on x86 systems at my college besides Linux so that they could teach about POSIX compliance, avoiding "Linuxisms," and say that they teach Unix and not just Linux. I wasn't able to stick with my degree because of Calculus, but I always wondered what working with it would have been like.
There were five things I learned that were encouraging to me early on.
1. Oracle Solaris and the illumos distributions build Firefox with GCC now, and haven't used Sun Studio to do so in ages, so all the code that makes those assumptions is outdated. In fact, most of OpenIndiana is built with GCC 7 specifically. They do use their own linker, but I knew going in I wouldn't have to deal with any clang weirdness.
2. Most of the GNU toolchain is available, but you have to prefix commands with "g" to get the GNU version instead of the Solaris version.
3. Mozilla regards Solaris as a Tier 2 or 3 platform, and a ton of high-quality patches for it were created during or just after the Firefox 52ESR lifecycle by Mozilla at the request of an incredibly overworked Oracle employee trying to get the biggest Solaris issues fixed upstream.
4. All of the UXP project's major dependencies, like SQLite, NSS, NSPR, libevent, libffi, and other libraries are available and more or less up-to-date on Solaris. NSS and NSPR have been on it since the beginning, with Netscape getting involved with Sun/Java offerings early on to power their server products back in the day.
5. Solaris and Linux are both based on System V in some form or other, unlike the BSDs. Solaris/illumos seriously has system headers with a 1989 AT&T copyright notice attached, because it is actually System V Unix code from Bell Labs that has barely changed in 30 years. So there's a lot of overlap in the design, and a lot of POSIX functionality to fall back on where the differences lie.
So after I got the system up and running, I tried to load a mozconfig file... and hit my first error before ever starting the build. Turns out that Solaris uses Ksh, and while Bash is available, it's hard to convince it to execute a script as a Bash script with all Bash features rather than a version limited to Ksh features. Anyway, it turned out Mozilla actually made a patch to remove the "Bash localism," and the mozconfig loader is now POSIX compliant (which it should have been in the first place). That was the first patch I applied.
From there, it was mostly a matter of applying build system patches so the build system would recognize Solaris. 90% of the time, it would take the same code as Linux, and it was like FreeBSD the other 10% of the time, basically. One theme that kept coming up was that I had to replace several memory-related functions like malign and madvise with posix_malign and posix_madvise, because Solaris has versions of those functions that take different arguments like caddr_t. This had to be ifdefed only because apparently a few versions of Linux don't actually have posix_malign and only have the regular version with the POSIX syntax. I would say that this was the most common unexpected compile error I kept getting caught by, some "malign" or "madvise" function somewhere in the code I forgot to change.
The build issue that consumed most of my time was figuring out why I was getting text relocations and .eh_frame issues in libxul.so. I learned everything I could about linkers and the ELF file format, and about libxul.so. Even to the point of reading Mike Hommey's blog and learning more about him, his interests, and the reasons behind his weird linker hacks and frustration with manual component registration than I really should have. I even found out that apparently on OI's official Firefox 52 build, the guy who got everything else working gave up and tried in desperation to build libxul.so with GNU LD (which usually doesn't work out) and use the Sun linker for the rest of it, and they were lucky that it worked. Turns out the reason that made it work is because Mozilla packaged a mapfile for GNU LD, and that linker actually fails even worse than the Sun linker without it.
However, it turned out that I had been trying to solve a problem I hadn't yet run into. My actual build issue was because of libffi, and it took me a while to figure out that it was relying on an external script to configure libffi that was making incorrect assumptions about several things. First issue is it assumed I wanted my .eh_frames to be read only just because I'm on x86. Well, that's not a safe assumption on Solaris, you want writable .eh_frames. Then I saw tons of text relocations, so I started researching how to avoid text relocations in PIC code (which Solaris seems to require). Then I found out you actually can't avoid them completely, because assembler code needs to access the global offset table at some point, and usually needs a PC relative relocation at some point to do so. Then, I remembered a comment I saw in a libffi source code file. "Solaris uses datarel encoding for PIC on x86." So I figured out that I had to enable that hack by changing Mozilla's libffi configuration not to use PC relative relocations on Solaris x86. So it does have a mechanism for allowing relative relocations of some kind, just not PC relative ones. That got rid of most of the text relocations, but I was still getting them in a file called win32.S, which was always included whether I wanted/needed it or not. I eventually looked at that code and found that the Solaris hack was not available there, and instead it hardcoded PC relative encoding. I was somehow able to look at that hack from sysv.S and copy it into win32.S, perform the same tests and make it apply datarel encoding where necessary (easier than it sounds if you see the file). After this, I'd already fixed an issue that made the libxul.so modules appear out of order on Solaris with a patch from Mozilla, so everything worked.
After this, I was finally able to build the browser, but it crashed almost immediately with an assertion failure to NS_IsMainThread() in NSS, that only one person had ever gotten before, and in their case it was an SSL policy issue. I found a way to avoid crashing right away by sheer accident. I specify the word "file" on the command line, and it takes me to a very simple HTTP page called file.com, with nothing but a single image on it advertising some kind of file storage service or something. None of the stacktraces really helped or made much sense, it appeared that the attempt to initialize NSS was itself the cause of the failure.
I compiled a debug version, took a crash course in how to read stacktraces, and tried in desperation applying several patches I didn't think were necessary and didn't really even like. I found this set of patches from Mozilla upstream that stabilized the browser and stopped the assertion failure, but only got it to work offline. It was able to load up XUL plugins and offline saved web pages in this state, as well as show about:config and such. It generated error pages saying the PSM component appeared to be broken or disabled. I could see threads in gdb spinning up and then crashing immediately every time I'd try to go online. I thought that NSS was completely busted for some reason. I even tried running the NSS test suite, but it passed and nothing seemed to be wrong.
I applied this one patch that changed the way the browser looked and completely busted the interface, kept it from saving any history, but only because I typed it in wrong. It went like this:
Code: Select all
I had a weird feeling this might have changed or fixed something else, so I removed the temporary NSS patches and tried loading the browser again... and although the interface was still broken, I could now type in any URL I wanted, and nothing crashed. For some reason, even YouTube was working in this state. Though it took a full minute for a video to start playing, it was smooth once it started playing back. It's a feat I haven't been able to replicate since, the videos just refuse to play entirely due to a software raster feature failure or something. The only change I'd made recently that seemed like it could have fixed things was a change to compile NSS and NSPR with pthreads after seeing that the repositories for the official OS versions had added them in.
Thinking that adding pthreads had solved the problem (a suggestion my my mind was vulnerable to because i remembered inexplicable segfaults on Linux 20 years ago due to things being compiled without them by default), I fixed that typo... and the browser started crashing again.
So I assumed that maybe something was wrong with SQLite, if busting the database access by accident had somehow made the browser work after resolving the NSS issue. I ended up making absolutely sure that SQLite built with -D_POSIX_PTHREAD_SEMANTICS and set it up to include a linker mapfile provided from the OI repositories to make absolutely sure it built correctly. And then everything started working again. I assumed I'd finally done it... but the the next day, while trying to get YouTube to work again and making very small changes, I was getting the same problem again with every build of the browser, even with the exact same configuration that had worked before.
When I figured out why, I felt like like a huge idiot. You want to know what the difference was between the browser successfully running, and crashing this whole time, since getting it to build? It was which terminal window I ran ./mach run from. Why? Because I'd used one of those terminal windows to run the NSS test suite. Why would that make a difference? While I was running the test suite... I'd added the files in dist/bin in the object directory to LD_LIBRARY_PATH because it didn't know where to look for its own object files. So whenever I tried to run the browser from the terminal window where I'd added the NSS I'd just built to the LD_LIBRARY_PATH, everything worked fine, and when I ran it from the other one, it crashed. And so the last several patches I'd been applying and things I'd thought I'd been doing to fix or break the browser were actually completely irrelevant. I'd probably had it working since the first time I got it built and didn't realize it had no idea where to find its own libraries in the build directory.
So yeah, apparently now it builds and runs on Solaris perfectly fine. VP9 videos work, YouTube videos try to work for a few frames and then stop, but I have a feeling it might work better on actual hardware rather than using a software renderer in a VM. I have to disable Libevent's use of Solaris event ports for some weird reason to stop websites from sending PHP files to me rather than trying to parse them on the server. But yeah, I somehow got this to work in just under a month, I think. It helped a lot that the browser hasn't had extensive changes to memory handling or assembler code, that there were a ton of existing patches to a code base very similar to this one for Solaris support (though there were a lot of low-quality ones and I did have to try and sift through them), and that most of the potential trouble points were in external libraries anyway.