summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--content/bootstrapGcc/13_tcc_to_gcc.md243
1 files changed, 243 insertions, 0 deletions
diff --git a/content/bootstrapGcc/13_tcc_to_gcc.md b/content/bootstrapGcc/13_tcc_to_gcc.md
new file mode 100644
index 0000000..0d3bcde
--- /dev/null
+++ b/content/bootstrapGcc/13_tcc_to_gcc.md
@@ -0,0 +1,243 @@
+Title: TinyCC to GCC gap is slowly closing
+Date: 2024-05-02
+Category:
+Tags: Bootstrapping GCC in RISC-V
+Slug: bootstrapGcc13
+Lang: en
+Summary: The sidetrack we took in the past started to give us some good news.
+ Here there are some.
+
+In [previous episodes we talked about getting
+sidetracked](/bootstrapGcc11.html) and we mentioned we needed to build Musl
+because we had limitations in our standard library. We didn't explain them in
+detail and I think it's the moment to do so, as many of the changes we proposed
+there have been tested and upstreamed, and explain the ramifications that
+process had.
+
+#### Symptoms
+
+TinyCC and our MeslibC are powerful enough to build Binutils. But not enough to
+make some of the programs, like GNU As, work.
+
+MeslibC is supersimple, meaning it doesn't really implement some of the things
+you might consider obvious. One of the best examples is `fopen`. Instead of
+returning a fresh `FILE` structure, in MeslibC `fopen` simply returns the
+underlying file descriptor, as returned by the kernel's `open` call. This is
+not a big problem, as the `fread` and `fclose` provided with MeslibC are
+compatible with this behaviour, but there's a very specific case where this is
+a problem. In GNU As, if no file is given as an input, it just tries to read
+from standard input, and it fails, saying there was no valid file descriptor.
+Why? Let's read the code GNU As uses to read files (`gas/input-file.c`):
+
+``` clike
+/* Open the specified file, "" means stdin. Filename must not be null. */
+
+void
+input_file_open (const char *filename,
+ int pre)
+{
+ int c;
+ char buf[80];
+
+ preprocess = pre;
+
+ gas_assert (filename != 0); /* Filename may not be NULL. */
+ if (filename[0])
+ {
+ f_in = fopen (filename, FOPEN_RT);
+ file_name = filename;
+ }
+ else
+ {
+ /* Use stdin for the input file. */
+ f_in = stdin;
+ /* For error messages. */
+ file_name = _("{standard input}");
+ }
+
+ if (f_in == NULL)
+ {
+ as_bad (_("can't open %s for reading: %s"),
+ file_name, xstrerror (errno));
+ return;
+ }
+
+ c = getc (f_in);
+ /* ... Continues ...*/
+```
+
+If MeslibC uses file descriptor integers as `FILE` structures, it's not hard to
+detect the problem in the example. For the cases where the selected filename is
+empty (no file to read from) `filename[0]` will be false (`\0` character), and
+`f_in` will be set to `stdin`. That should normally mean some `FILE` structure
+with an internal file descriptor of value `0`, the one corresponding to the
+standard input. As the structure is not `NULL` the error message below won't
+trigger. As I just explained, MeslibC uses kernel's file descriptors instead of
+`FILE` structures so `stdin` in MeslibC is just `0`, which is equal to `NULL`
+for the compiler, so the error message is triggered and the execution stops.
+
+MeslibC's clever solution for filenames is simply failing due to the fact that
+C has no error types, and errors are signalled in the standard library using
+`NULL`.
+
+This is just a simple case to exemplify how MeslibC affects our bootstrapping
+chain, but there are others. For example, MeslibC can't `ungetc` more than once
+because that was enough for the bootstrapping as it was designed for x86, but
+as we moved to a more recent binutils version (the first one supporting
+RISC-V), that became an obstacle, and it's preventing us from running GNU As.
+
+Of course, all of these problems could be fixed in MeslibC, but in the end the
+goal of MeslibC is not to be a proper C standard library implementation, but a
+helper for the bootstrapping of more powerful standard libraries that already
+exist. These problems, and some others we also found, are just drawing the line
+of *when* do we need to jump to a more mature C standard library in our chain.
+Looks like binutils is where that line is drawn.
+
+#### Musl
+
+The bootstrapping chain as conceived in Guix uses GLibC, as Guix is a GNU
+project, but we found Musl to be a more suitable C standard library for these
+initial steps as it is simple an easy to build while keeping all the
+functionality you might expect from a proper C standard library.
+
+We ran into some issues though, as upstream TinyCC's RISC-V backend wasn't
+ready to build it.
+
+First of all, TinyCC's RISC-V backend had no support for Extended ASM, so I
+implemented it and sent it upstream.
+
+Once I did that we built Musl and we realized we had issues in some functions.
+The problem was the Extended Asm implementation was not understanding the
+constraints properly and those parameters marked as read and write were not
+considered correctly. I talked with Michael, the author of that piece of code,
+because I didn't understand the behaviour well. He guided me a little and I
+proceeded to fix it in all architectures.
+
+Still, we couldn't build Musl because it was using some atomic instructions
+that were not implemented in TinyCC's RISC-V assembler and we decided to avoid
+them, patching around them in Musl. They happened to be important for memory
+allocation (LOL) so I decided to implement them in TinyCC's assembler and push
+the changes upstream. I implemented `lr` (load reserved), `sc` (store
+conditional) and extended `fence`'s behaviour to match what the GNU Assembler
+(the reference RISC-V assembler) would do.
+
+Still this wasn't enough for Musl to build properly as TinyCC's RISC-V backed
+was not implemented as a proper assembly but as instructions in human readable
+text. RISC-V is a RISC architecture and makes a heavy use of pseudoinstructions
+to ease the development of assembly programs. Before all this work, TinyCC
+only implemented simple instructions and almost no pseudoinstruction expansion.
+
+Also, its architecture couples argument parsing with relocation generation and
+it doesn't really help to implement pseudoinstructions with variable argument
+count or default values. I added enough code to avoid falling in the problems
+this design decision had and pushed everything upstream. The list includes
+support for many pseudoinstructions, proper relocation use for several
+instruction families like `jal` and branches, and some other things. In the
+end, we do not have a fully featured assembler yet, but we do have enough to
+build the simple code we find in a C standard library like Musl. In fact, even
+using the syntax that any RISC-V assembler would expect, as I explained in
+more detail [here](/bootstrapGcc11.html).
+
+#### Meslibc
+
+Once all those changes are finally applied to TinyCC, we can remove the weird
+split we needed to do in MeslibC to support make it match the TinyCC assembly
+syntax, so I did that. Less code, less problems.
+
+Also my colleague Andrius added a `realpath` stub, to make us able to build
+upstream TinyCC without having to patch the places where `realpath` was used in
+it. `realpath` is not a simple function to implement, and it's not critical in
+TinyCC. Again MeslibC doesn't need to be perfect, only let us start building
+everything.
+
+#### TinyCC
+
+With all those changes coming to MeslibC and the ones we upstreamed, we now
+don't need to patch on top of upstream TinyCC, so all our small changes on top
+of it are dropped now. Less code, less problems.
+
+We could have kept these changes for ourselves, but sharing them is not only
+easier, but also better for everyone. The following is the complete list of
+changes I upstreamed to TinyCC, a project that we are not really part of, but
+this is what we do and what we believe in.
+
+* `0aca8611` fixup! riscv: Implement large addend for global address
+* `8baadb3b` riscv: asm: implement `j offset`
+* `15977630` riscv: asm: Add branch to label
+* `671d03f9` riscv: Add full `fence` instruction support
+* `c9940681` riscv: asm: Add load-reserved and store-conditional
+* `0703df1a` Fix Extended Asm ignored constraints
+* `6b3cfdd0` riscv: Add extended assembly support
+* `e02eec6b` riscv: fix jal: fix reloc and parsing
+* `02391334` fixup! riscv: Add .option assembly directive (unimp)
+* `cbe70fa6` riscv: Add .option assembly directive (unimp)
+* `618c1734` riscv: libtcc1.c support some builtins for \_\_riscv
+* `3782da8d` riscv: Support $ in identifiers in extended asm.
+* `e2d8eb3d` riscv: jal: Add pseudo instruction support
+* `409007c9` riscv: jalr: implement pseudo and parse like GAS
+* `8bfef6ab` riscv: Add pseudoinstructions
+* `8cbbd2b8` riscv: Use GAS syntax for loads/stores:
+* `019d10fc` riscv: Move operand parsing to a separate function
+* `7bc0cb5b` riscv: Implement large addend for global address
+
+#### Bootstrappable TinyCC
+
+During the bootstrapping process we detected new issues and one of them was so
+deep it took pretty long to detect and solve.
+
+Most of the programs we were building with our Bootstrappable TinyCC worked:
+GZip, Make... But we reached a point were we needed to rebuild upstream TinyCC
+with Musl, in order to start using Musl to build the next programs. It didn't
+work.
+
+We had a really hard time finding the problem behind this because it appeared
+too far in the chain to be easy. The process goes like this.
+
+We use Mes to build our very first Bootstrappable TinyCC, which compiles itself
+several times (6), until it reaches its final state. That then builds upstream
+TinyCC and with that we build TinyCC again this time using Musl as its standard
+library. We found this last one was unable to build simple files and we started
+digging.
+
+We realized TinyCC was using sign extension in `unsigned` values, and that was
+messing up with the next TinyCC, making it unable to build programs correctly.
+Researching this deeply we found the problem was in the `load` function of
+TinyCC but a TinyCC built with GCC didn't have this problem. The only option
+was that the Bootstrappable TinyCC had the bug that was later affecting the
+compilers compiled with it.
+
+Digging a little bit further I found the casts from Bootstrappable TinyCC had
+some missing cases that I didn't backport properly but as I wasn't able to
+understand them very well I decided to backport the full `gen_cast` function
+from upstream to the Bootstrappable TinyCC. With that, the errors from TinyCC
+were gone.
+
+It feels like an accidental trusting trust attack, yes. This is the kind of
+things we have to deal with, and they are pretty tiring and frustrating to
+find.
+
+#### The new Bootstrapping chain
+
+So, all of this brings us to the new bootstrapping chain. We need to make
+things very different to the way Guix does them right now, because we are
+skipping many steps (GCC 2.95, now we need Musl for Binutils...) so I started
+[a project](https://github.com/ekaitz-zarraga/commencement.scm) to track how we
+go forward in the bootstrapping chain (it's just a wip, for our tests, take
+that in account).
+
+We had good and bad news in that regard. At the moment of writing we managed to
+build up to the GCC 4.6.4 I added RISC-V support to, but the compiler is faulty
+and it's unable to build itself again with the C++ support.
+
+I'm using non-bootstrapped versions of `flex` and `bison`, but those
+shouldn't be hard to bootstrap either. I just didn't have the time to make them
+from scratch. And I'm using a `bash` instead of `gash` because we had found a
+blocking error in `gash` that is not letting us continue forward from Binutils.
+
+In any case, this means we are near from the next milestone: building GCC 4.6.4
+with TinyCC; and as we described in the previous post we already built GCC 7.5
+from GCC 4.6.4 so we solved the next already.
+
+After those, we would need to clean this new bootstrapping chain and talk with
+Guix for its inclusion in there. I hope we can finish all this before hitting
+the deadline that is silently approaching...