Title: TinyCC to GCC gap is slowly closing Date: 2024-05-02 Category: Tags: Bootstrapping GCC in RISC-V Slug: bootstrapGcc13 Lang: en Summary: The sidetrack we took in the past started to give us some good news. Here there are some. In [previous episodes we talked about getting sidetracked](/bootstrapGcc11.html) and we mentioned we needed to build Musl because we had limitations in our standard library. We didn't explain them in detail and I think it's the moment to do so, as many of the changes we proposed there have been tested and upstreamed, and explain the ramifications that process had. #### Symptoms TinyCC and our MeslibC are powerful enough to build Binutils. But not enough to make some of the programs, like GNU As, work. MeslibC is supersimple, meaning it doesn't really implement some of the things you might consider obvious. One of the best examples is `fopen`. Instead of returning a fresh `FILE` structure, in MeslibC `fopen` simply returns the underlying file descriptor, as returned by the kernel's `open` call. This is not a big problem, as the `fread` and `fclose` provided with MeslibC are compatible with this behaviour, but there's a very specific case where this is a problem. In GNU As, if no file is given as an input, it just tries to read from standard input, and it fails, saying there was no valid file descriptor. Why? Let's read the code GNU As uses to read files (`gas/input-file.c`): ``` clike /* Open the specified file, "" means stdin. Filename must not be null. */ void input_file_open (const char *filename, int pre) { int c; char buf[80]; preprocess = pre; gas_assert (filename != 0); /* Filename may not be NULL. */ if (filename[0]) { f_in = fopen (filename, FOPEN_RT); file_name = filename; } else { /* Use stdin for the input file. */ f_in = stdin; /* For error messages. */ file_name = _("{standard input}"); } if (f_in == NULL) { as_bad (_("can't open %s for reading: %s"), file_name, xstrerror (errno)); return; } c = getc (f_in); /* ... Continues ...*/ ``` If MeslibC uses file descriptor integers as `FILE` structures, it's not hard to detect the problem in the example. For the cases where the selected filename is empty (no file to read from) `filename[0]` will be false (`\0` character), and `f_in` will be set to `stdin`. That should normally mean some `FILE` structure with an internal file descriptor of value `0`, the one corresponding to the standard input. As the structure is not `NULL` the error message below won't trigger. As I just explained, MeslibC uses kernel's file descriptors instead of `FILE` structures so `stdin` in MeslibC is just `0`, which is equal to `NULL` for the compiler, so the error message is triggered and the execution stops. MeslibC's clever solution for filenames is simply failing due to the fact that C has no error types, and errors are signalled in the standard library using `NULL`. This is just a simple case to exemplify how MeslibC affects our bootstrapping chain, but there are others. For example, MeslibC can't `ungetc` more than once because that was enough for the bootstrapping as it was designed for x86, but as we moved to a more recent binutils version (the first one supporting RISC-V), that became an obstacle, and it's preventing us from running GNU As. Of course, all of these problems could be fixed in MeslibC, but in the end the goal of MeslibC is not to be a proper C standard library implementation, but a helper for the bootstrapping of more powerful standard libraries that already exist. These problems, and some others we also found, are just drawing the line of *when* do we need to jump to a more mature C standard library in our chain. Looks like binutils is where that line is drawn. #### Musl The bootstrapping chain as conceived in Guix uses GLibC, as Guix is a GNU project, but we found Musl to be a more suitable C standard library for these initial steps as it is simple an easy to build while keeping all the functionality you might expect from a proper C standard library. We ran into some issues though, as upstream TinyCC's RISC-V backend wasn't ready to build it. First of all, TinyCC's RISC-V backend had no support for Extended ASM, so I implemented it and sent it upstream. Once I did that we built Musl and we realized we had issues in some functions. The problem was the Extended Asm implementation was not understanding the constraints properly and those parameters marked as read and write were not considered correctly. I talked with Michael, the author of that piece of code, because I didn't understand the behaviour well. He guided me a little and I proceeded to fix it in all architectures. Still, we couldn't build Musl because it was using some atomic instructions that were not implemented in TinyCC's RISC-V assembler and we decided to avoid them, patching around them in Musl. They happened to be important for memory allocation (LOL) so I decided to implement them in TinyCC's assembler and push the changes upstream. I implemented `lr` (load reserved), `sc` (store conditional) and extended `fence`'s behaviour to match what the GNU Assembler (the reference RISC-V assembler) would do. Still this wasn't enough for Musl to build properly as TinyCC's RISC-V backed was not implemented as a proper assembly but as instructions in human readable text. RISC-V is a RISC architecture and makes a heavy use of pseudoinstructions to ease the development of assembly programs. Before all this work, TinyCC only implemented simple instructions and almost no pseudoinstruction expansion. Also, its architecture couples argument parsing with relocation generation and it doesn't really help to implement pseudoinstructions with variable argument count or default values. I added enough code to avoid falling in the problems this design decision had and pushed everything upstream. The list includes support for many pseudoinstructions, proper relocation use for several instruction families like `jal` and branches, and some other things. In the end, we do not have a fully featured assembler yet, but we do have enough to build the simple code we find in a C standard library like Musl. In fact, even using the syntax that any RISC-V assembler would expect, as I explained in more detail [here](/bootstrapGcc11.html). #### Meslibc Once all those changes are finally applied to TinyCC, we can remove the weird split we needed to do in MeslibC to support make it match the TinyCC assembly syntax, so I did that. Less code, less problems. Also my colleague Andrius added a `realpath` stub, to make us able to build upstream TinyCC without having to patch the places where `realpath` was used in it. `realpath` is not a simple function to implement, and it's not critical in TinyCC. Again MeslibC doesn't need to be perfect, only let us start building everything. #### TinyCC With all those changes coming to MeslibC and the ones we upstreamed, we now don't need to patch on top of upstream TinyCC, so all our small changes on top of it are dropped now. Less code, less problems. We could have kept these changes for ourselves, but sharing them is not only easier, but also better for everyone. The following is the complete list of changes I upstreamed to TinyCC, a project that we are not really part of, but this is what we do and what we believe in. * `0aca8611` fixup! riscv: Implement large addend for global address * `8baadb3b` riscv: asm: implement `j offset` * `15977630` riscv: asm: Add branch to label * `671d03f9` riscv: Add full `fence` instruction support * `c9940681` riscv: asm: Add load-reserved and store-conditional * `0703df1a` Fix Extended Asm ignored constraints * `6b3cfdd0` riscv: Add extended assembly support * `e02eec6b` riscv: fix jal: fix reloc and parsing * `02391334` fixup! riscv: Add .option assembly directive (unimp) * `cbe70fa6` riscv: Add .option assembly directive (unimp) * `618c1734` riscv: libtcc1.c support some builtins for \_\_riscv * `3782da8d` riscv: Support $ in identifiers in extended asm. * `e2d8eb3d` riscv: jal: Add pseudo instruction support * `409007c9` riscv: jalr: implement pseudo and parse like GAS * `8bfef6ab` riscv: Add pseudoinstructions * `8cbbd2b8` riscv: Use GAS syntax for loads/stores: * `019d10fc` riscv: Move operand parsing to a separate function * `7bc0cb5b` riscv: Implement large addend for global address #### Bootstrappable TinyCC During the bootstrapping process we detected new issues and one of them was so deep it took pretty long to detect and solve. Most of the programs we were building with our Bootstrappable TinyCC worked: GZip, Make... But we reached a point were we needed to rebuild upstream TinyCC with Musl, in order to start using Musl to build the next programs. It didn't work. We had a really hard time finding the problem behind this because it appeared too far in the chain to be easy. The process goes like this. We use Mes to build our very first Bootstrappable TinyCC, which compiles itself several times (6), until it reaches its final state. That then builds upstream TinyCC and with that we build TinyCC again this time using Musl as its standard library. We found this last one was unable to build simple files and we started digging. We realized TinyCC was using sign extension in `unsigned` values, and that was messing up with the next TinyCC, making it unable to build programs correctly. Researching this deeply we found the problem was in the `load` function of TinyCC but a TinyCC built with GCC didn't have this problem. The only option was that the Bootstrappable TinyCC had the bug that was later affecting the compilers compiled with it. Digging a little bit further I found the casts from Bootstrappable TinyCC had some missing cases that I didn't backport properly but as I wasn't able to understand them very well I decided to backport the full `gen_cast` function from upstream to the Bootstrappable TinyCC. With that, the errors from TinyCC were gone. It feels like an accidental trusting trust attack, yes. This is the kind of things we have to deal with, and they are pretty tiring and frustrating to find. #### The new Bootstrapping chain So, all of this brings us to the new bootstrapping chain. We need to make things very different to the way Guix does them right now, because we are skipping many steps (GCC 2.95, now we need Musl for Binutils...) so I started [a project](https://github.com/ekaitz-zarraga/commencement.scm) to track how we go forward in the bootstrapping chain (it's just a wip, for our tests, take that in account). We had good and bad news in that regard. At the moment of writing we managed to build up to the GCC 4.6.4 I added RISC-V support to, but the compiler is faulty and it's unable to build itself again with the C++ support. I'm using non-bootstrapped versions of `flex` and `bison`, but those shouldn't be hard to bootstrap either. I just didn't have the time to make them from scratch. And I'm using a `bash` instead of `gash` because we had found a blocking error in `gash` that is not letting us continue forward from Binutils. In any case, this means we are near from the next milestone: building GCC 4.6.4 with TinyCC; and as we described in the previous post we already built GCC 7.5 from GCC 4.6.4 so we solved the next already. After those, we would need to clean this new bootstrapping chain and talk with Guix for its inclusion in there. I hope we can finish all this before hitting the deadline that is silently approaching...