diff options
Diffstat (limited to 'content/bootstrapGcc')
-rw-r--r-- | content/bootstrapGcc/13_tcc_to_gcc.md | 243 |
1 files changed, 243 insertions, 0 deletions
diff --git a/content/bootstrapGcc/13_tcc_to_gcc.md b/content/bootstrapGcc/13_tcc_to_gcc.md new file mode 100644 index 0000000..0d3bcde --- /dev/null +++ b/content/bootstrapGcc/13_tcc_to_gcc.md @@ -0,0 +1,243 @@ +Title: TinyCC to GCC gap is slowly closing +Date: 2024-05-02 +Category: +Tags: Bootstrapping GCC in RISC-V +Slug: bootstrapGcc13 +Lang: en +Summary: The sidetrack we took in the past started to give us some good news. + Here there are some. + +In [previous episodes we talked about getting +sidetracked](/bootstrapGcc11.html) and we mentioned we needed to build Musl +because we had limitations in our standard library. We didn't explain them in +detail and I think it's the moment to do so, as many of the changes we proposed +there have been tested and upstreamed, and explain the ramifications that +process had. + +#### Symptoms + +TinyCC and our MeslibC are powerful enough to build Binutils. But not enough to +make some of the programs, like GNU As, work. + +MeslibC is supersimple, meaning it doesn't really implement some of the things +you might consider obvious. One of the best examples is `fopen`. Instead of +returning a fresh `FILE` structure, in MeslibC `fopen` simply returns the +underlying file descriptor, as returned by the kernel's `open` call. This is +not a big problem, as the `fread` and `fclose` provided with MeslibC are +compatible with this behaviour, but there's a very specific case where this is +a problem. In GNU As, if no file is given as an input, it just tries to read +from standard input, and it fails, saying there was no valid file descriptor. +Why? Let's read the code GNU As uses to read files (`gas/input-file.c`): + +``` clike +/* Open the specified file, "" means stdin. Filename must not be null. */ + +void +input_file_open (const char *filename, + int pre) +{ + int c; + char buf[80]; + + preprocess = pre; + + gas_assert (filename != 0); /* Filename may not be NULL. */ + if (filename[0]) + { + f_in = fopen (filename, FOPEN_RT); + file_name = filename; + } + else + { + /* Use stdin for the input file. */ + f_in = stdin; + /* For error messages. */ + file_name = _("{standard input}"); + } + + if (f_in == NULL) + { + as_bad (_("can't open %s for reading: %s"), + file_name, xstrerror (errno)); + return; + } + + c = getc (f_in); + /* ... Continues ...*/ +``` + +If MeslibC uses file descriptor integers as `FILE` structures, it's not hard to +detect the problem in the example. For the cases where the selected filename is +empty (no file to read from) `filename[0]` will be false (`\0` character), and +`f_in` will be set to `stdin`. That should normally mean some `FILE` structure +with an internal file descriptor of value `0`, the one corresponding to the +standard input. As the structure is not `NULL` the error message below won't +trigger. As I just explained, MeslibC uses kernel's file descriptors instead of +`FILE` structures so `stdin` in MeslibC is just `0`, which is equal to `NULL` +for the compiler, so the error message is triggered and the execution stops. + +MeslibC's clever solution for filenames is simply failing due to the fact that +C has no error types, and errors are signalled in the standard library using +`NULL`. + +This is just a simple case to exemplify how MeslibC affects our bootstrapping +chain, but there are others. For example, MeslibC can't `ungetc` more than once +because that was enough for the bootstrapping as it was designed for x86, but +as we moved to a more recent binutils version (the first one supporting +RISC-V), that became an obstacle, and it's preventing us from running GNU As. + +Of course, all of these problems could be fixed in MeslibC, but in the end the +goal of MeslibC is not to be a proper C standard library implementation, but a +helper for the bootstrapping of more powerful standard libraries that already +exist. These problems, and some others we also found, are just drawing the line +of *when* do we need to jump to a more mature C standard library in our chain. +Looks like binutils is where that line is drawn. + +#### Musl + +The bootstrapping chain as conceived in Guix uses GLibC, as Guix is a GNU +project, but we found Musl to be a more suitable C standard library for these +initial steps as it is simple an easy to build while keeping all the +functionality you might expect from a proper C standard library. + +We ran into some issues though, as upstream TinyCC's RISC-V backend wasn't +ready to build it. + +First of all, TinyCC's RISC-V backend had no support for Extended ASM, so I +implemented it and sent it upstream. + +Once I did that we built Musl and we realized we had issues in some functions. +The problem was the Extended Asm implementation was not understanding the +constraints properly and those parameters marked as read and write were not +considered correctly. I talked with Michael, the author of that piece of code, +because I didn't understand the behaviour well. He guided me a little and I +proceeded to fix it in all architectures. + +Still, we couldn't build Musl because it was using some atomic instructions +that were not implemented in TinyCC's RISC-V assembler and we decided to avoid +them, patching around them in Musl. They happened to be important for memory +allocation (LOL) so I decided to implement them in TinyCC's assembler and push +the changes upstream. I implemented `lr` (load reserved), `sc` (store +conditional) and extended `fence`'s behaviour to match what the GNU Assembler +(the reference RISC-V assembler) would do. + +Still this wasn't enough for Musl to build properly as TinyCC's RISC-V backed +was not implemented as a proper assembly but as instructions in human readable +text. RISC-V is a RISC architecture and makes a heavy use of pseudoinstructions +to ease the development of assembly programs. Before all this work, TinyCC +only implemented simple instructions and almost no pseudoinstruction expansion. + +Also, its architecture couples argument parsing with relocation generation and +it doesn't really help to implement pseudoinstructions with variable argument +count or default values. I added enough code to avoid falling in the problems +this design decision had and pushed everything upstream. The list includes +support for many pseudoinstructions, proper relocation use for several +instruction families like `jal` and branches, and some other things. In the +end, we do not have a fully featured assembler yet, but we do have enough to +build the simple code we find in a C standard library like Musl. In fact, even +using the syntax that any RISC-V assembler would expect, as I explained in +more detail [here](/bootstrapGcc11.html). + +#### Meslibc + +Once all those changes are finally applied to TinyCC, we can remove the weird +split we needed to do in MeslibC to support make it match the TinyCC assembly +syntax, so I did that. Less code, less problems. + +Also my colleague Andrius added a `realpath` stub, to make us able to build +upstream TinyCC without having to patch the places where `realpath` was used in +it. `realpath` is not a simple function to implement, and it's not critical in +TinyCC. Again MeslibC doesn't need to be perfect, only let us start building +everything. + +#### TinyCC + +With all those changes coming to MeslibC and the ones we upstreamed, we now +don't need to patch on top of upstream TinyCC, so all our small changes on top +of it are dropped now. Less code, less problems. + +We could have kept these changes for ourselves, but sharing them is not only +easier, but also better for everyone. The following is the complete list of +changes I upstreamed to TinyCC, a project that we are not really part of, but +this is what we do and what we believe in. + +* `0aca8611` fixup! riscv: Implement large addend for global address +* `8baadb3b` riscv: asm: implement `j offset` +* `15977630` riscv: asm: Add branch to label +* `671d03f9` riscv: Add full `fence` instruction support +* `c9940681` riscv: asm: Add load-reserved and store-conditional +* `0703df1a` Fix Extended Asm ignored constraints +* `6b3cfdd0` riscv: Add extended assembly support +* `e02eec6b` riscv: fix jal: fix reloc and parsing +* `02391334` fixup! riscv: Add .option assembly directive (unimp) +* `cbe70fa6` riscv: Add .option assembly directive (unimp) +* `618c1734` riscv: libtcc1.c support some builtins for \_\_riscv +* `3782da8d` riscv: Support $ in identifiers in extended asm. +* `e2d8eb3d` riscv: jal: Add pseudo instruction support +* `409007c9` riscv: jalr: implement pseudo and parse like GAS +* `8bfef6ab` riscv: Add pseudoinstructions +* `8cbbd2b8` riscv: Use GAS syntax for loads/stores: +* `019d10fc` riscv: Move operand parsing to a separate function +* `7bc0cb5b` riscv: Implement large addend for global address + +#### Bootstrappable TinyCC + +During the bootstrapping process we detected new issues and one of them was so +deep it took pretty long to detect and solve. + +Most of the programs we were building with our Bootstrappable TinyCC worked: +GZip, Make... But we reached a point were we needed to rebuild upstream TinyCC +with Musl, in order to start using Musl to build the next programs. It didn't +work. + +We had a really hard time finding the problem behind this because it appeared +too far in the chain to be easy. The process goes like this. + +We use Mes to build our very first Bootstrappable TinyCC, which compiles itself +several times (6), until it reaches its final state. That then builds upstream +TinyCC and with that we build TinyCC again this time using Musl as its standard +library. We found this last one was unable to build simple files and we started +digging. + +We realized TinyCC was using sign extension in `unsigned` values, and that was +messing up with the next TinyCC, making it unable to build programs correctly. +Researching this deeply we found the problem was in the `load` function of +TinyCC but a TinyCC built with GCC didn't have this problem. The only option +was that the Bootstrappable TinyCC had the bug that was later affecting the +compilers compiled with it. + +Digging a little bit further I found the casts from Bootstrappable TinyCC had +some missing cases that I didn't backport properly but as I wasn't able to +understand them very well I decided to backport the full `gen_cast` function +from upstream to the Bootstrappable TinyCC. With that, the errors from TinyCC +were gone. + +It feels like an accidental trusting trust attack, yes. This is the kind of +things we have to deal with, and they are pretty tiring and frustrating to +find. + +#### The new Bootstrapping chain + +So, all of this brings us to the new bootstrapping chain. We need to make +things very different to the way Guix does them right now, because we are +skipping many steps (GCC 2.95, now we need Musl for Binutils...) so I started +[a project](https://github.com/ekaitz-zarraga/commencement.scm) to track how we +go forward in the bootstrapping chain (it's just a wip, for our tests, take +that in account). + +We had good and bad news in that regard. At the moment of writing we managed to +build up to the GCC 4.6.4 I added RISC-V support to, but the compiler is faulty +and it's unable to build itself again with the C++ support. + +I'm using non-bootstrapped versions of `flex` and `bison`, but those +shouldn't be hard to bootstrap either. I just didn't have the time to make them +from scratch. And I'm using a `bash` instead of `gash` because we had found a +blocking error in `gash` that is not letting us continue forward from Binutils. + +In any case, this means we are near from the next milestone: building GCC 4.6.4 +with TinyCC; and as we described in the previous post we already built GCC 7.5 +from GCC 4.6.4 so we solved the next already. + +After those, we would need to clean this new bootstrapping chain and talk with +Guix for its inclusion in there. I hope we can finish all this before hitting +the deadline that is silently approaching... |