Title: Milestone — MesCC builds TinyCC and fun C errors for everyone
Date: 2023-10-30
Category:
Tags: Bootstrapping GCC in RISC-V
Slug: bootstrapGcc8
Lang: en
Summary:
We spent the last months making MesCC able to compile TinyCC and making the
result of that compilation able to compile TinyCC. Many cool problems
appeared, this is the summary of our work.
It's been a while since the latest technical update in the project and I am
fully aware that you were missing it so it's time to recap with a really cool
announcement:
**We finally made a self-hosted Bootstrappable TinyCC in RISC-V**
Most of you probably remember I [already backported](bootstrapGcc6.html) the
Bootstrappable TinyCC compiler, but I didn't test it in a proper environment.
Now, we can confidently say it is able to compile itself, a "large" program
that makes use of more complex C features than I did in the tests.
All this work was done by Andrius Štikonas and myself. Janneke helped us a lot
with Mes related parts, too. The work this time was pretty hard, honestly. Most
of the things we did here are not obvious, even for C programmers.
I'm not used to this kind of quirks of the C language. Most of them are really
specific, related with the standards and many others are just things were
missing. I hope the ones I chose to discuss here help you understand your
computing better, as they did to me.
This is going to be veery long post, so take a ToC to help you out:
1. [Context](#context)
1. [Why is this important?](#why-important)
2. [Problems fixed](#problems)
1. [TinyCC misses assembly instructions needed for MesLibC](#tinycc-missing-instructions)
2. [TinyCC's assembly syntax is weird](#tcc-assembly)
3. [TinyCC does not support Extended Asm in RV64](#extended-assembly)
4. [MesLibC `main` function arguments are not set properly](#main-args)
5. [TinyCC says `__global_pointer$` is not a valid symbol](#dollars)
6. [Bootstrappable TinyCC's casting issues](#tcc-casting-issues)
7. [Bootstrappable TinyCC's `long double` support was missing](#long-double)
8. [MesCC struct initialization issues](#mescc-struct-init)
9. [MesCC vs TinyCC size problems](#size-problems)
10. [MesCC add support for signed shift operation](#mes-signed-shift)
11. [MesCC switch/case falls-back to default case](#broken-case)
12. [Boostrappable TinyCC problems with GOT](#got)
13. [Bootstrappable TinyCC generates wrong assembly in conditionals](#wrong-conditionals)
14. [Support for variable length arguments](#varargs)
15. [MesLibC use `signed char` for `int8_t`](#int8)
16. [MesLibC Implement `setjmp` and `longjmp`](#jmp)
17. [More](#more)
3. [Reproducing what we did](#reproducing)
1. [Using live-bootstrap](#live-bootstrap)
1. [Using Guix](#guix)
4. [Conclusions](#conclusions)
5. [What is next?](#next)
### Context {#context}
You have many blogposts in the series to find the some context about the
project, and even a FOSDEM talk about it, but they all give a very broad
explanation, so let's focus on what we are doing right now.
Here we have Mes, a Scheme interpreter, that runs MesCC, a C compiler, that is
compiling our simplified fork of TinyCC, let's call that Bootstrappable TinyCC.
That Bootstrappable TinyCC compiler then tries to compile its own code. It
compiles it's own code because it's goal is to add more flags in each
compilation, so it has more features in each round[^rounds]. We do all this
because TinyCC is way faster than MesCC and it's also more complex, but MesCC
is only able to build a simple TinyCC with few features enabled.
[^rounds]: There are many rounds. Like 7 or so.
During all this process we use a standard library provided by the Mes project,
we'll call it MesLibC, because we can't build glibc at this point, and TinyCC
does not provide it's own C standard library.
With all this well understood, this is the achievement:
**We made MesCC able to compile the Bootstrappable TinyCC, using MesLibC, to an
executable that is able to compile the Bootstrappable TinyCC's codebase to a
binary that works and has all the features we need enabled.**[^self-hosted]
[^self-hosted]: So it can compile itself again an again, but who would want to
do that?
The process affected all the pieces in the system. We added changes in MesCC,
MesLibC and the Bootstrappable TinyCC.
#### Why is this important? {#why-important}
We already talked long about the bootstrapping issue, the trusting trust attack
and all that. I won't repeat that here. What I'll do instead is to be specific.
This step is a big thing because this allows us to go way further in the chain.
All the steps before Mes were already ported to RISC-V mostly thanks to Andrius
Štikonas who worked in [Stage0-POSIX][stage0] and the rest of glue projects
that are needed to reach Mes.
[stage0]: https://github.com/oriansj/stage0-posix
Mes had been ported to RISC-V (64 bit) by W. J. van der Laan, and some patches
were added on top of it by Andrius Štikonas himself before our current effort
started.
At this moment in time, Mes was unable to build our bootstrappable TinyCC in
RISC-V, the next step in the process, and the bootstrappable TinyCC itself was
unable to build itself either. This was a very limiting point, because TinyCC
is the first "proper" C compiler in the chain.
When I say "proper" I mean fast and fully featured as a C compiler. In x86,
TinyCC is able to compile old versions of GCC. If we manage to port it to
RISC-V we will eventually be able to build GCC with it and with that the world.
In summary, TinyCC is a key step in the bootstrapping chain.
### Problems fixed {#problems}
This work can be easily followed in the commits in my TCC fork's
[`riscv-mes`][tcc] branch, and in my Mes clone's [`riscv-tcc-boot`][mes]
branch. We are also identifying the contents of this blogpost in the git
history by adding the git tag `self-hosted-tcc-rv64` to both of my forks. We
will try to keep both for future reference.
In Mes the process might be a little bit harder to follow because we sent most
of the patches to Janneke and he merged them so when we were about to release
this post I continued from Janneke's branch to avoid divergences (I had some
problems with that before). In any case, the code is there and searching by
authors (Andrius and myself) would guide you to the changes we did.
[tcc]: https://github.com/ekaitz-zarraga/tcc/tree/riscv-mes
[mes]: https://github.com/ekaitz-zarraga/mes/tree/riscv-tcc-boot
Many commits have a long message you can go read there, but this post was born
to summarize the most interesting changes we did, and write them in a more
digestible way. Lets see if I manage to do that.
The following list is not ordered in any particular way, but we hope the
selection of problems we found is interesting for you. We found some errors
more, but these are the ones we consider more relevant.
#### TinyCC misses assembly instructions needed for MesLibC {#tinycc-missing-instructions}
TinyCC is not like GCC, TinyCC generates binary code directly, no assembly code
in between. TinyCC has a separate assembler that doesn't follow the path that C
code follows.
It works the same in all architectures, but we can take RISC-V as an example:
TinyCC has `riscv64-gen.c` which generates the binary files, but
`riscv64-asm.c` file parses assembly code and also generates binary. As you can
see, binary generation is somehow duplicated.
In the RISC-V case, the C part had support for mostly everything since my
backport, but the assembler did not support many instructions (which, by the
way are supported by the C part).
MesLibC's `crt1.c` is written in assembly code. Its goal is to prepare the
`main` function and call it. For that it needs to call `jalr` instruction and
others that were not supported by TinyCC, neither upstream nor our
bootstrappable fork.
These changes appear in several commits because I didn't really understood how
the TinyCC assembler worked, and some instructions need to use relocations
which I didn't know how to add. The following commit can show how it feels to
work on this, and shares how relocations are done:
[lla-commit]: https://github.com/ekaitz-zarraga/tcc/commit/1e597f3d239d9119d2ea4bb3ca29b587ea594dcc
There you can see we started to understand things in TinyCC, but some other
changes came after this.
A very important not here is upstream TinyCC does not have support for these
instructions yet so we need to patch upstream TinyCC when we use it, contribute
the changes or find other kind of solutions. Each solution has its downsides
and upsides, so we need to take a decision about this later.
#### TinyCC's assembly syntax is weird {#tcc-assembly}
Following with the previous fix, TinyCC does not support GNU-Assembler's syntax
in RISC-V. It uses a simplified assembly syntax instead.
When we would do:
``` asm
sd s1, 8(a0)
```
In TinyCC's assembly we have to do:
``` asm
sd a0, s1, 8
```
This requires changes in MesLibC, and it makes us create a separate folder for
TinyCC in MesLibC. See `lib/riscv64-mes-tcc/` and `lib/linux/riscv64-mes-tcc`
for more details.
#### TinyCC does not support Extended Asm in RV64 {#extended-assembly}
Way later in time we also found TinyCC does not support [Extended Asm][ext-asm]
in RV64. The functions that manage that are simply empty.
[ext-asm]: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
We spent some time until we realized what was going on in here for two reasons.
First, there are few cases of Extended Asm in the code we were compiling.
Second, it was failing silently.
Extended Asm is important because it lets you tell the compiler you are going
to touch some registers in the assembly block, so it can protect variables and
apply optimizations properly.
In our case, our assembly blocks were clobbering some variables that would have
been protected by the compiler if the Extended Asm support was implemented.
Andrius found all the places in MesLibC where Extended Asm was used and rewrote
the assembly code to keep variables safe in the cases it was needed.
The other option was to add Extended Asm support for TinyCC, but we would need
to add it in the Bootstrappable TinyCC and also upstream. This also means
understanding TinyCC codebase very well and making the changes without errors,
so we decided to simplify MesLibC, because that is easier to make right. We are
probably going to need to do this later on anyway, but we'll try to delay this
as much as possible.
#### MesLibC `main` function arguments are not set properly {#main-args}
Following the previous problem with assembly, we later found input arguments of
the `main` function, that come from the command line arguments, were not
properly set by our MesLibC. Andrius also took care of that in
[4f4a1174][main-ext] in Mes.
[main-ext]: https://github.com/ekaitz-zarraga/mes/commit/4f4a11745d1c7ed0995e9d31c7994abfb4a60b25
This error was easier to find than others because when we found issues with
this we already had a compiled TinyCC. So we just needed to fix simple things
around it.
#### TinyCC says `__global_pointer$` is not a valid symbol {#dollars}
This is a small issue that was a headache for a while, but it happened to be a
very simple issue.
In RISC-V there's a symbol, `__global_pointer$`, that is used for dynamic
linking, defined in the ABI. But TinyCC had issues to parse code around it and
it took us some time to realize it was the dollar sign (`$`) which was causing
the issues in this point.
TinyCC does not process dollars in identifiers unless you specifically set a
flag (`-fdollars-in-identifiers`) when running it. In the RISC-V case, that
flag must be always active because if it isn't the `__global_pointer$` can't be
processed.
We tried to set that flag in the command line but we had other issues in the
command line argument parsing (we found and fixed them later later) so we just
hardcoded it.
This issue is interesting because it's an extremely simple problem, but its
effect appears in weird ways and it's not always easy to know where the problem
is coming from.
#### Bootstrappable TinyCC's casting issues {#tcc-casting-issues}
This one was a really hard one to fix.
When running our Bootstrappable TinyCC to build MesLibC we found this error:
``` nothing
cannot cast from/to void
```
We managed to isolate a piece of C code that was able to replicate the
problem.[^reproducer]
``` clike
long cast_charp_to_long (char const *i)
{
return (long)i;
}
long cast_int_to_long (int i)
{
return (long)i;
}
long cast_voidp_to_long (void const *i)
{
return (long)i;
}
void main(int argc, char* argv[]){
return;
}
```
Compiling this file raised the same issue, but then I realized I could remove
two of the functions on the top and the error didn't happen. Adding one of
those functions back raised the error again.
I tried to change the order of the functions and the functions I chose to add,
and I could reproduce it: if there were two functions it failed but it could
build with only one.
Andrius found that the function type was not properly set in the RISC-V code
generation and its default value was `void`, so it only failed when it compiled
the second function.
Knowing that, we could take other architectures as a reference to fix this, and
so we did.
See [6fbd1785][tcc-casting-commit].
[tcc-casting-commit]: https://github.com/ekaitz-zarraga/tcc/commit/6fbd17852aa11a2d0bc047183efaca4ff57ab80c
[^reproducer]: This is how we managed to fix most of the problems in our code:
make a small reproducer we can test separately so we can inspect the
process and the result easily.
#### Bootstrappable TinyCC's `long double` support was missing {#long-double}
When I backported the RISC-V support to our Bootstrappable TinyCC I missed the
`long double` support and I didn't realize that because I never tested large
programs with it.
The C standard doesn't define a size for `long double` (it just says it has to
be at least as long as the `double`), but its size is normally set to 16 bytes.
All this is weird in RV64, because it doesn't have 16 byte size registers. It
needs some extra support.
Before we fixed this, the following code:
``` clike
long double f(int a){
return a;
}
```
Failed with:
``` nothing
riscv64-gen.c:449 (`assert(size == 4 || size == 8)`)
```
Because it was only expecting to use `double`s (8 bytes) or `float`s (4 bytes).
In upstream TinyCC there were some commits that added `long double` support
using, and I quote, a *mega hack*, so I just copied that support to our
Bootstrappable TinyCC.
See [a7f3da33456b][tcc-long-double].
[tcc-long-double]: https://github.com/ekaitz-zarraga/tcc/commit/a7f3da33456b4354e0cc79bb1e3f4c665937395b
After this commit, some extra problems appeared with some missing symbols. But
these errors were link-time problems, because TinyCC had the floating point
helper functions needed for RISC-V defined in `lib/lib-arm64.c`, because they
were reusing aarch64 code for them.
After this, we also compile and link `lib-arm64.c` and we have `long double`
support.
#### MesCC struct initialization issues {#mescc-struct-init}
This one was a lot of fun. Our Bootstrappable TinyCC exploded with random
issues: segfaults, weird branch decisions...
After tons of debugging Andrius found some values in `struct`s were not set
properly. As we don't really know TinyCC's codebase really well, that was hard
to follow and we couldn't really know where was the value coming from.
Andrius finally realized some `struct`s were not initialized properly. Consider
this example:
``` clike
typedef struct {
int one;
int two;
} Thing;
Thing a = {0};
```
That's supposed to initialize *all* fields in the `Thing` `struct` to `0`,
according to the C standard[^cppref].
As a first solution we set struct fields manually to `0`, to make sure they
were initialized properly. See [29ac0f40a7afb][tinycc-struct-0]
[tinycc-struct-0]: https://github.com/ekaitz-zarraga/tcc/commit/29ac0f40a7afba6a2d055df23a8ee2ee2098529e
After some debugging we found that the fields that were not explicitly set were
initialized to `22`. So I decided to go to MesCC and see if the struct
initialization was broken.
This was my first dive in MesCC's code, and I have to say it's really easy to
follow. It took me some time to read through it because I'm not that used to
`match`, but I managed to find the struct initialization code.
What I found in MesCC is there was a `22` hardcoded in the struct
initialization code, probably coming from some debug code that never was
removed. As no part of the x86 bootstrapping used that kind of initializations,
or nothing relied on them, the error went unnoticed.
I set that to `0`, as it should be, and continued with our life.
[^cppref]: You can see an explanation in the (1) case at
[cppreference.com](https://en.cppreference.com/w/c/language/struct_initialization)
#### MesCC vs TinyCC size problems {#size-problems}
The C standard does not set a size for integers. It only sets relative sizes:
`short` has to be shorter or equal to `int`, `int` has to be shorter or equal
to a `long`, and so on. If you platform wants, all the integers, including the
`char`s can have 8 bits, and that's ok for the C standard.
TinyCC's RISC-V backed was written under the assumption that `int` is 32 bit
wide. You can see this happening in `riscv64-gen.c`, for example, here:
``` clike
EI(0x13, 0, rr, rr, (int)pi << 20 >> 20); // addi RR, RR, lo(up(fc))
```
The bit shifting there is done to clear the upper 20 bits of the pi variable.
This code's behavior might be different from one platform to another. Taking
the example before, of that possible platform that only has 8 bit integers,
this code would send a `0` instead of the lower 12 bits of `pi`.
In our case, we had MesCC using the whole register width, 64bits, for temporary
values so the lowest `44` bits were left and the next assertion that checked
the immediate was less than 12 bits didn't pass.
This is a huge problem, as most of the code in the RISC-V generation is written
using this style.
There are other ways to do the same thing (`pi & 0xFFF` maybe?) in a more
portable way, but we don't know why upstream TinyCC decided to do it this way.
Probably they did because GCC (and TinyCC itself) use 32 bit integers, but they
didn't handle other possible cases, like the one we had here with MesCC.
In any case, this made us rethink MesCC, dig on how are its integers defined,
how to change this to be compatible with TinyCC and so on, but I finally
decided to add casts in the middle to make sure all this was compiled as
expected.
It was a good reason to make us re-think MesCC's integers, but it took a very
long time to deal with this, that could be better used in something else. Now,
we all became paranoids about integers and we still think some extra errors
will arise from them in the future. Integers are hard.
#### MesCC add support for signed shifting {#mes-signed-shift}
Integers were in our minds for long, as described in the previous block, but I
didn't talk about signedness in that one.
Following one of the crazy errors we had in TinyCC, I somehow realized (I don't
remember how!) that we were missing signed shifting support in MesCC. I think
that I found this while doing some research of the code MesCC was outputting
when I spotted some bit shifts done using unsigned instructions for signed
values and I started digging in MesCC to find out why. I finally realized that
there was no support for that and the shift operation wasn't selected
depending on the signedness of the value being shifted.
Let's see this with an example:
``` clike
signed char a = 0xF0;
unsigned char b = 0xF0;
// What is this? (Answer: 0xFF => 255)
a >> 4;
// And this? (Answer: 0x0F => 15)
b >> 4;
```
In the example you can see the shifting operation does not work the same way if
the value is signed or not. If you always use the unsigned version of the `>>`
operation, you don't have the results you expected. Signs are also hard.
In this case, like in many others, the fix was easier than realizing what was
going wrong. I just added support for the signed shifting operation, not only
for RISC-V but for all architectures, and I added the correct signedness check
to the shifting operation to select the correct instruction. The patch (see
[88f24ea8][signed-rotation] in Mes) is very clean and easy to read, because
MesCC's codebase is really well ordered.
> EDIT: Some person in the web noted I called the *bit-shift* operations
> *rotation* operations. I normally use both words interchangeably but it is
> true they don't mean the exact same thing. A shift is when the values are
> lost, and a rotation when they come from the other side of the register. I
> edited the article to use the correct word.
[signed-rotation]: https://github.com/ekaitz-zarraga/mes/commit/88f24ea8661dd279c2a919f8fbd5f601bb2509ae
#### MesCC switch/case falls-back to default case {#broken-case}
In the early bootstrap runs, our Bootstrappable TinyCC it did weird things.
After many debugging sessions we realized the `switch` statements in
`riscv64-gen.c`, more specifically in `gen_opil`, were broken. The fall-backs
in the `switch` were automatically directed to the `default` case. Weird!
MesCC has many tests so I read all that were related with the `switch`
statements and the ones that handled the fall-backs were all falling-back to
the `default` case, so our weird behavior wasn't tested.
I added the tests for our case and read the disassemble of simple examples when
I realized the problem.
Each of the `case` blocks has two parts: the clause that checks if the value
of the expression is the one of the case, and the body of the case itself.
The `switch` statement generation was doing some magic to deal with `case`
blocks, but it was failing to deal with complex fall-through schemes because
the clause of the target `case` block was always run, making the code fall to
the `default` case, as the clause was always false because the one that matched
was the one that made the fall-back.
There were some problems to fix this, as NyaCC (MesCC's C parser) returns
`case` blocks as nested when they don't have a `break` statement:
``` lisp
(case testA
(case testB
(case testC BODY)))
```
Instead of doing this, I decided to flatten the `case` blocks with empty
bodies. This way we can deal with the structure in a simpler way.
``` lisp
((case testA (expr-stmt))
(case testB (expr-stmt))
(case testC BODY))
```
Once this is done, I expanded each `case` block to a jump that jumps over the
clause, the clause and then its body. Doing this, the fall-back doesn't
re-evaluate the clause, as it doesn't need to. The generated code looks like
this in pseudocode:
``` assembly
;; This doesn't have the jump because it's the first
CASE1:
testA
CASE1_BODY:
...
goto CASE2_BODY
CASE2:
testB
CASE2_BODY:
...
goto CASE3_BODY
CASE3:
testB
CASE3_BODY:
...
```
If one of the `case`s has a `break`, it's treated as part of its body, and it
will end the execution of the `switch` statement normally, no fall-back.
This results in a simpler `case` block control. The previous approach dealt
with nested `case` blocks and tried to be clever about them, but
unsuccessfully. The best thing about this commit is most of the cleverness was
simply removed with a simple solution (flatten all the things!).
It wasn't that easy to implement, but I first built a simple prototype and
Janneke's scheme magic made my approach usable in production.
All this is added in Mes's codebase in several commits, as we needed some
iterations to make it right. [22cbf823582][cases] has the base of this commit,
but there were some iterations more in Mes.
[cases]: https://github.com/ekaitz-zarraga/mes/commit/22cbf823582e3699b6a21ee0cf74c2dbf0a6a4e9
#### Boostrappable TinyCC problems with GOT {#got}
The Global Offset Table is a table that helps with relocatable binaries. Our
Bootstrappable TinyCC segfaulted because it was generating an empty GOT.
Andrius debugged upstream TinyCC alongside ours and realized there was a
missing check in an `if` statement. He fixed it in
[f636cf3d4839d1ca][got-commit].
The problem with this kind of errors is TinyCC's codebase is really hard to
read. It's a very small compiler but it's not obvious to see how things are
done on it, so we had to spend many hours in debugging sessions that went
nowhere. If we had a compiler that is easier to read and change, it would be
way simpler to fix and we would have had a better experience with it.
[got-commit]: https://github.com/ekaitz-zarraga/tcc/commit/f636cf3d4839d1ca3f5af9c0ad9aef43a4bfccd9
#### Bootstrappable TinyCC generates wrong assembly in conditionals {#wrong-conditionals}
We spent a long time debugging a bug I introduced during the backport when I
tried to undo some optimization upstream TinyCC applied to comparison
operations.
Consider the following code:
``` clike
if ( x < 8 )
whatever();
else
whatever_else();
```
Our Bootstrappable TinyCC was unable to compile this code correctly, instead,
it outputted a code that always took the same branch, regardless of the value
in `x`.
In TinyCC, a conditional like `if (x < CONSTANT)` has a special treatment, and
it's converted to something like this pseudoassembly:
``` pseudo
load x to a0
load CONSTANT to a1
set a0 if less than a1
branch if a0 not equal 0 ; Meaning it's `set`
```
This behaviour uses the `a0` register as a flag, emulating what other CPUs
use for comparisons. RISC-V doesn't need that, but it's still done here
probably for compatibility with other architectures. In RISC-V it could look
like this:
``` pseudo
load x to a0
load CONSTANT to a1
branch if a0 less than a1
```
You can easily see the `branch` "instruction" does a different comparison in
one case versus the other. In the one in the top it checks if `a0` is set,
and in the other checks if `a0` is smaller than `a1`.
TinyCC handles this case in a very clever way (maybe too clever?). When they
emit the `set a0 if less than a1` instruction they replace the current
comparison operation with `not equal` and they remove the `CONSTANT` and
replace it with a `0`. That way, when the `branch` instruction is generated,
they insert the correct clause.
In my code I forgot to replace the comparison operator so the branch checked
`if a0 is less than 0` and it was always false, as the `set` operation writes
a `0` or a `1` and none of them is less than `0`.
The commit [5a0ef8d0628f719][branch-tcc] explains this in a more technical way,
using actual RISC-V instructions.
This was also a hard to fix, because TinyCC's variable names (`vtop->c.i`) are
really weird and they are used for many different purposes.
[branch-tcc]: https://github.com/ekaitz-zarraga/tcc/commit/5a0ef8d0628f719ebb01c952797a86a14051228c
#### Support for variable length arguments {#varargs}
In C you can define functions with variable argument length. In RISC-V, those
arguments are sent using registers while in other architectures are sent using
the stack. This means the RISC-V case is a little bit more complex to deal
with, and needs special treatment.
Andrius realized in our Bootsrappable TinyCC we had issues with variable length
arguments, specially in the most famous function that uses them: `printf`. He
also found that the problem came from the arguments not being properly set and
found the problem.
Reading upstream TinyCC we found they use a really weird system for the defines
that deal with this. They have a header file, `include/tccdefs.h`, which is
included in the codebase, but also processed by a tool that generates strings
that are later injected at execution time in TinyCC.
This was too much for us so we just extracted the simplest variable arguments
definitions for RISC-V and introduced that in MesLibC and our Bootstrappable
TinyCC.
##### Extra: files generated with no permissions
The bootstrappable TinyCC built using MesCC generated files with no permissions
and Andrius found that this problem came from the variable length argument
support definitions. So he fixed that, too[^stikonas].
The macro that defined `va_start` was broken pointer arithmetic. At the
beginning he thought it was related with MesCC's internals but he tested in GCC
later and realized the problem was in the macro definition. That's why
currently the commit says "workaround" in the name, but it's more than a
workaround: it's a proper fix. We are rewording that, but that would happen
after we release this post.
[^stikonas]: He is like that.
#### MesLibC use `signed char` for `int8_t` {#int8}
We already had a running Bootstrappable TinyCC compiled using MesCC when we
stumbled upon this issue. Somehow, when assembling:
``` asm
addi a0, a0, 9
```
The code was trying to read `9` as a register name, and failed to do it (of
course). It was weird to realize that the following code (in `riscv64-asm.c`)
was always using the true branch in the `if` statement, even if
`asm_parse_regvar` returned `-1`:
``` clike
int8_t reg;
...
if ((reg = asm_parse_regvar(tok)) != -1) {
...
} else ...
```
I disassembled and saw something like this:
``` pseudoassembly
call asm_parse_regvar ;; Returns value in a0
reg = a0
a0 = a0 + 1
branch if a0 equals 0
```
This looks ok, it does some magic with the `-1` but it makes sense anyway. The
problem is that it didn't branch because `a0` was `256` even when
`asm_parse_regvar` returned `-1`.
During some of the `int` related problems someone told me in the Fediverse that
`char`'s default signedness is not defined in the C standard. I read MesLibC
and, exactly: `int8_t` was defined as an alias to `char`.
In RISC-V `char` is by default `unsigned` (don't ask me why) but we are used to
x86 where it's `signed` by default. Only saying `char` is not portable.
Replacing:
``` clike
typedef char int8_t;
```
With:
``` clike
typedef signed char int8_t;
```
Fixed the issue.
From this you can learn several things:
1. Don't assume `char`'s signedness in C
2. If you design a programming language, be consistent with your decisions. In
C `int` is always `signed int`, but `char`'s don't act like that. Don't do
this.
#### MesLibC Implement `setjmp` and `longjmp` {#jmp}
Those that are not that versed in C, as I was before we found this issue, won't
know about `setjmp` and `longjmp` but they are, simplifying a lot, like a
`goto` you can use in any part of the code. `setjmp` needs a buffer and it
stores the state of the program on it, `longjmp` sets the status of the program
to the values on the buffer, so it jumps to the position stored in `setjmp`.
Both functions are part of the C standard library and they need specific
support for each architecture because they need to know which registers are
considered part of the state of the program. They need to know how to store the
program counter, the return address, and so on, and how to restore them.
In their simplest form they are a set of stores in the case of the `setjmp` and
a set of loads in the case of `longjmp`.
In RISC-V they only need to store the `s*` registers, as they are the ones that
are not treated as temporary. It's simple, but it needs to be done, which
wasn't in neither for GCC nor for RISC-V in MesLibC.
Andrius is not convinced with our commit in here, and I agree with his
concerns. We added the full `setjmp` and `longjmp` implementations directly
stolen from inspired in the ones in Musl[^stolen] but it has also
floating point register support, using instructions that are not implemented in
TinyCC yet. This is going to be a problem in the future because later
iterations will try to execute instructions they don't actually understand.
There are two (or three) possible solutions here. The first is to remove the
floating point instructions for now (another flavor for this solution is to
hide them under an `#ifdef`). The second is to implement the floating point
instructions in TinyCC's RISC-V assembler, which sounds great but forces us to
upstream the changes, and that process may take long and we'd need to patch it
in our bootstrapping scripts until it happens.
We just added the `#ifdef`s because our code is full of them anyway and sent it
to Mes: [0e2c5569][setjmp].
[setjmp]: https://github.com/ekaitz-zarraga/mes/commit/0e2c55697df285250c8a24442f169bc52d729c31
[^stolen]: Yo, if it's free software it's not stealing! Please steal my code.
Make it better.
#### More {#more}
Those are mostly the coolest errors we needed to deal with but we stumbled upon
a lot of errors more.
Before this effort started Andrius added support for 64 bit instructions in Mes
and fixed some issues 64bit architectures had in M2.
I found a [bug in Guix shell](https://issues.guix.gnu.org/65225) (it's still
open) and had to fix some ELF headers in MesCC generated files because objdump
and gdb refused to work on them.
Andrius also found issues with weak symbols in MesLibC that were triggered
because TCC didn't have support for them, thankfully upstream TCC had that
issue fixed and we just cherry-picked for the win.
He even had the energy to test all this in real RISC-V we specifically acquired
for this task.
There are many more things to tell, but this is already getting too long and if
I continue writing we'll probably end up fixing some stuff more.
In the end, a project like this is like hitting your head against a wall until
one of them breaks. Sometimes it feels like the head did, but it's all good.
#### Reproducing what we did {#reproducing}
All we did means nothing if you can't reproduce it. We provide two ways to
reproduce this process: live-bootstrap and Guix.
Both provide a similar thing but there are some differences from the
high-level that is worth mention now.
Comparing with `live-bootstrap`, using Guix helps because it reuses the
previous steps if they didn't change. This results in shorter waits once Mes is
sorted out.
On the other hand, I've have had issues with the failed builds in Guix (in
emulated systems). It was hard to jump inside the build container and play
around inside so the development cycle suffered a lot. In `live-bootstrap`, if
you are good with `bwrap` you can jump and tweak things with no issues.
For those who enjoy digging in the code and trying to follow the process I
recommend following `live-bootstrap`'s scripts. The directory structure is a
little bit confusing but the scripts are very plain and linear. The ones in the
Guix process come from previous bootstrap efforts and they are designed to do
many things automagically, that makes them a hard to follow.
##### Using live-bootstrap {#live-bootstrap}
Andrius is part of the `live-bootstrap` effort and he's doing all the scripting
there to keep the process reproducible.
[Live-bootstrap](https://github.com/fosslinux/live-bootstrap) is...
> An attempt to provide a reproducible, automatic, complete end-to-end
> bootstrap from a minimal number of binary seeds to a supported fully
> functioning operating system.
That's the official description of the project. From a more practical
perspective, it's a set of scripts that build the whole operating system from
scratch, depending on few binary seeds.
That's not very different to what Guix provides from a bootstrapping
perspective. Guix is "just" an environment where you can run "scripts" (the
packages define how they are built) in a reproducible way. Of course, Guix is
way more than that, but if we focus on what we are doing right now it acts like
the exact same thing.
> NOTE: `live-bootstrap`'s project description is a little bit outdated. If you
> read the comparison with Guix, what you'd read is old information. If you
> want to read a more up-to-date information about Guix's bootstrapping process
> I suggest you to read this page of Guix manual:
>
Being very different projects, in a practical level, the main difference
between them is `live-bootstrap` is probably easier for you to test if you are
working on any GNU/Linux distribution[^in-guix].
[^in-guix]: If you run it in Guix or in a distribution that doesn't follow FHS
you'd probably need to touch the path of your Qemu installation or be
careful with the options you send to the `rootfs.py` script.
If you want to reproduce this exact point in time you only need to use my fork
of [live-bootstrap](https://github.com/ekaitz-zarraga/live-bootstrap/), branch
`riscv-tcc-boot`. I also made a tag on it, `self-hosted-tcc-rv64`, to make it
easier to remember when was this post released. Andrius made all the magic to
set that process to take all the inputs from Mes and TinyCC from the correct
tag.
Clone the repository, set up the dependencies and run this (if you are not in a
RISC-V host you need to configure Qemu and binfmt):
``` bash
./rootfs.py --bwrap --arch riscv64 --preserve
```
That should, after a long time, reach a point where there's a properly compiled
bootstrappable TinyCC.
#### Using Guix for a reproducible environment {#guix}
I made a Guix recipe that can replicate the whole process, too. It took me long
time to make it work but it finally does.
From my TCC fork reproducing this should be easy for the people versed in Guix.
There's a `guix` folder with some files, (most of them broken, not gonna lie)
but there are two you should pay attention to:
- `channels.scm` stores the state of my Guix checkout so you can reproduce it
in the future using `guix time-machine`. At the moment it doesn't feel
necessary but if something fails when you try it, please refer to that.
- `commencement.scm` is an edited copy of the Guix bootstrapping process,
directly obtained from `gnu/packages/commencement.scm` from Guix's codebase.
I patched this to make it work for RISC-V, using some more modern commits in
the dependencies.
In order to reproduce all our work in Guix you just need to build `tcc-boot0`
package from the `commencement.scm` file using `riscv64-linux` as your
`--system`. I'm a nice guy so I just added a command there you can use for
this, just run:
``` bash
./tcc-boot0-from-source.sh
```
And that should build the whole thing. It takes hours, you have been warned.
Also it adds `--no-grafts` (thanks Efraim), because if you keep the grafts it
compiles the world from scratch (curl, x11... not good).
If you just want to build `mes-boot` as an intermediate step, I also made a
file for that:
``` bash
./mes-boot-from-source.sh
```
The both scripts will load variables from the `commencement.scm` module
provided. The module is not complex if you are used to Guix, but it calls
some complex shell scripts in both Mes and TinyCC to build. Those contain all
the magic.
### Conclusions {#conclusions}
Of course, the problems we fixed now look easy and simple to fix. This blog
post doesn't really do justice to the countless debugging hours and all the
nights we, Andrius and I, spent thinking about where could the issues be
coming from.
The debugging setup wasn't as good as you might imagine. The early steps of the
bootstrap don't have all the debug symbols as a "normal" userspace program
would. In many cases, function names were all we had.
I have thank my colleague Andrius here because he did a really good debugging
job, and he provided me with small reproducers that I could finally fix. Most
of the times he made the assist and I scored the goal.
He also did a great job with the testing which I couldn't do because I was
struggling with Guix from the early days, trying to make the compilers find the
header files and libraries.
In the emotional part it is also a great improvement to have someone to rely
on. Andrius, Janneke and I had a good teamwork and we supported each other when
our faith started to crumble. And believe, it does crumble when a new bug
appears after you fixed one that you needed a week for. There were times this
summer I thought we would never reach this point.
It's also worth mention here that the bootstrapping process is extremely slow:
it takes hours. This kills the responsiveness and makes testing way harder than
it should be. Not to mention that we are working on a foreign architecture,
which has it's own problems too.
If you have to take some lesson from something like this, here you have a
suggestion list:
- The simplest error can take ages to debug if your code is crazy enough.
- Don't be clever. It sets a very high standard for your future self and people
who will read your code in the future.
- I guess we can summarize the previous two points in one: If we could remove
TinyCC from the chain, we would. It's a source of errors and it's hard to
debug. The codebase is really hard to read for no apparent reason.
- When build times are long, small reproducers help.
- Add tests for each new case you find.
- Don't trust, disassemble and debug.
- Be careful with C and standards and undefined behavior.
- Integers are hard. Signedness makes them harder.
- Being surrounded by the correct people makes your life easier.
Also, as a personal note I noticed I'm a better programmer since the previous
post in the this series. I feel way more comfortable with complex reasoning and
even writing new programs in other languages, even if I spent almost no time
coding anything from scratch. It's like dealing with this kind of issues about
the internals give you some level of awareness that is useful in a more general
way than it looks. Crazy stuff.
If you can, try to play with the internals of things from time to time. It
helps. At least it helped me.
### What is next? {#next}
Now we have a fully featured Bootstrappable TinyCC we need to decide what to do
next.
On the short term, all this has to be released in the original projects: Mes,
M2, and so on. That's the easy part, as everything has proved to be ready.
On the mid term, it's not very clear what to do first. We suspect we'll need
upstream TinyCC for the next steps, because we many different tools to
continue with the bootstrapping chain, and the bootstrappable TinyCC might not
be enough to build them. On the other hand, when we go for a standard library
we'll miss the extended assembly support we already mentioned. There's some
uncertainty in the next step.
The long-term is pretty much clear though, the goal is GCC. First GCC for C and
then for C++ to make it able build GCC 7.5 which should enable the rest of the
chain pretty easily (famous last words). I anticipate we are going to have
problems with GCC (I know this because I left them there last time) so we'll
need to fix those, too. Once that is done, we would use GCC to compile more
recent versions of GCC until we compile the world.
That's more or less the description of what we will do in the next months.
And this is pretty much it. I hope you learned something new about C, the
Bootstrapping process or at least had a good time reading this wall of text.
We'll try to work less for the next one, but we can't promise that. 😉
Take care.
---