From 57ebc576bf74ca887f9c77cb6b7771b9ff2b1843 Mon Sep 17 00:00:00 2001 From: Ekaitz Zarraga Date: Mon, 14 Mar 2022 23:52:34 +0100 Subject: NlNet project posts 0-2 --- content/bootstrapGcc/00_intro.md | 133 ++++++++ content/bootstrapGcc/01_internals.md | 621 +++++++++++++++++++++++++++++++++++ content/bootstrapGcc/02_elf.md | 595 +++++++++++++++++++++++++++++++++ content/bootstrapGcc/NGIAssure.svg | 1 + content/bootstrapGcc/nlnet.svg | 34 ++ 5 files changed, 1384 insertions(+) create mode 100644 content/bootstrapGcc/00_intro.md create mode 100644 content/bootstrapGcc/01_internals.md create mode 100644 content/bootstrapGcc/02_elf.md create mode 100644 content/bootstrapGcc/NGIAssure.svg create mode 100644 content/bootstrapGcc/nlnet.svg diff --git a/content/bootstrapGcc/00_intro.md b/content/bootstrapGcc/00_intro.md new file mode 100644 index 0000000..e3570ad --- /dev/null +++ b/content/bootstrapGcc/00_intro.md @@ -0,0 +1,133 @@ +Title: Intro to GCC bootstrap in RISC-V +Date: 2022-02-14 +Category: +Tags: Bootstrapping GCC in RISC-V +Slug: bootstrapGcc0 +Lang: en +Summary: + Introduction to my new adventure bootstrapping GCC for RISC-V. Why, how, + and who is going to pay for it. + +You probably already know about how I spent more than a year having fun with +RISC-V and software bootstrapping from source. + +As some may know from my [FOSDEM talk][fosdem22], [NLNet / NGI-Assure put the +funds][nlnet] to make me spend more time on this for this year and I decided +to work on GCC's bootstrapping process for RISC-V. + +[nlnet]: https://nlnet.nl/project/GNUMes-RISCV/ +[fosdem22]: https://fosdem.org/2022/schedule/event/riscvadventures/ + +### Why GCC + +GCC is probably the most used compiler collection, period. With GCC we can +compile the world and have a proper distribution directly from source, but who +compiles the compiler?[^1] + +[^1]: *wHo wATcHes tHE wAtchMEN?* + +Well, someone has to. + +### The bootstrap + +Bootstrapping a compiler with a long history like GCC for a new architecture +like RISC-V involves some complications, starting on the fact that the first +version of GCC that supports RISC-V needs a C++98 capable compiler in order to +build. C++98 is a really complex standard, so there's no way we can bootstrap a +C++98 compiler at the moment for RISC-V. The easiest way we can think of at +this point is to use an older version of GCC for that, one of those that are +able to build C++98 programs but they only require a C compiler to build. Older +versions of GCC, of course, don't have RISC-V support so... We need a +*backport*[^2]. + +[^2]: Insert "Back to the Future" music here. + +So that's what I'm doing right now. I'm taking an old version of GCC that only +depends on C89 and is able to compile C++98 code and I'm porting it to RISC-V +so we can build newer GCCs with it. + +Only needing C to compile it's a huge improvement because there are *Tiny C +Compilers* out there that can compile C to RISC-V, and those are written using +simple C that we can bootstrap with simpler tools of a more civilized world. + +In summary: + +- C++98 is too complex, but C89 is fine. +- GCC is the problem and also the solution. + +### What about GNU Mes? + +When *we*[^3] started with this effort we wanted to prepare GNU Mes, a small C +compiler that is able to compile a *Tiny C Compiler*, to work with RISC-V so we +could start to work in this bootstrap process from the bottom. + +[^3]: "*We*" means I shared my thoughts and plans with other people who have a + much better understanding of this than myself. + +Some random events, like someone else working on that part, made us rethink our +strategy so we decided to start from the top and try to combine both efforts at +the end. We share the same goal: full source bootstrap for RISC-V. + +### Tiny C Compilers? + +There are many small C compilers out there that are written in simple C and are +able to compile an old GCC that is written in C. Our favorite is TinyCC (Tiny C +Compiler). + +[^4]: But there are some others that are really interesting (see +[cproc](https://sr.ht/~mcf/cproc/), for example) + +GNU Mes is able to build a patched version of TinyCC, which already supports +RISC-V (RV64 only), and we can use that TinyCC to compile the GCC version I'm +backporting. + +We'd probably need to patch some things in both projects to make everything +work smoothly but that's also included in the project plan. + +### Binutils + +Binutils is also a problem mostly because GCC, as we will talk about in the +future, does not compile to binary directly. GCC generates assembly code and +coordinates calls to `as` and `ld` (the GNU Assembler and Linker) to generate +the final binaries. Thankfully, TinyCC can act as an assembler and a linker, +and there's also the chance to compile a modern binutils version because it is +written in C. + +In any case, the binary file generation and support must be taken in account, +because GCC is not the only actor in this film and RISC-V has some weird things +on the assembly and the binaries that have to be supported correctly. + +### Conclusion + +This is a very interesting project, where I need to dig in **BIG** stuff, which +is cool, but also has a huge level of uncertainty, which scares the hell out of +me. I hope everything goes well... + +In any case, I'll share all I learn here in the blog and I keep you all posted +with the news we have. + +That's all for this time. If you have any question or comment or want to share +your thoughts and feelings with me[^5] you can find my +[contact information here](https://ekaitz.elenq.tech/pages/about.html). + +[^5]: Or even hire me for some freelance IT stuff 🤓 + +--- + +> PS: Big up to NlNet / NGI-Assure for the money. + + +
+ + +
diff --git a/content/bootstrapGcc/01_internals.md b/content/bootstrapGcc/01_internals.md new file mode 100644 index 0000000..a6009e1 --- /dev/null +++ b/content/bootstrapGcc/01_internals.md @@ -0,0 +1,621 @@ +Title: GCC internals — From a porting perspective +Date: 2022-03-08 +Category: +Tags: Bootstrapping GCC in RISC-V +Slug: bootstrapGcc1 +Lang: en +Summary: + Deep diving into GCC's internals from the perspective of someone who + wants to port GCC for a new architecture. + + +In the [previous post]({filename}00_intro.md) of the +[series]({tag}Bootstrapping GCC in RISC-V) the problem of the GCC bootstrapping +was introduced. In this post we'll describe how GCC works, from the +perspective of someone who wants to port it so we understand what's the job we +have to do. + +1. [Disclaimer](#disclaimer) +2. [Overview](#intro) + 1. [The compiler generation framework](#cgf) + 2. [GCC as a coordinator](#gcc-coordinator) +3. [Source code parsing](#parsing) + 1. [GENERIC](#generic) +4. [GIMPLE](#gimple) +5. [Register Transfer Language](#rtl) + 1. [Target-dependent code](#target-dependent) + 2. [Machine description files](#md) + 1. [Machine modes](#mm) + 2. [RTL Templates](#rtl-templates) + 3. [Target description macros and functions](#target-desc) +6. [Assembly code generation](#assembly) +7. [Summary](#summary) +8. [My job in the backport](#job) +9. [Last words](#last) +10. [Learn more](#more) + +### Disclaimer {#disclaimer} + +- This post may be only valid for old GCC versions, like 4.something, because + that's the one I'm interested in. More recent versions may have different + details, but I don't expect them to be very different to what is described + here. More specifically: I'm working on GCC 4.6.4, and the first GCC with + RISC-V support is GCC 7.0.0. +- This post will focus on how GCC compiles C programs because that's the part + we care about. Some other languages have differences on how they are treated + but that's not very relevant for us, as it has no implications on the + *back-end*. + +Both of these points will get clearer later. + +### Overview {#intro} + +GCC is structured as a pipeline of several steps that run one after the other. + +1. Source code parsing +2. GIMPLE IR generation (target-independent) +3. Some GIMPLE tree optimizations +4. RTL IR generation (target-dependent) +5. RTL optimizer +6. Assembly code generator + +Before starting to analyze each of the steps independently there are a couple +of things to clarify. + +#### The Compiler Generation Framework {#cfg} + +An important point to note is GCC is a **compiler collection** meaning that it +is able to compile code from many high level languages (HLL) and for many +different targets. This has implications on how some steps are mapped to GCC's +source code. + +The most important thing of all this is to differentiate between GCC's code and +an actual `gcc` executable. The key point here is that GCC's codebase includes +what is called CGF (Compiler Generator Framework) that can generate `gcc` +executables from GCC's code. The CGF generates `gcc` executables according to +the input (target machine, host machine...) we give it, but the generated `gcc` +executables may differ one from another even if they were generated from the +same codebase. + +Any `gcc` executable is able to compile any input HLL[^language] (C, C++, +Objective-C, Ada, Fortran and Go[^java]), so GCC's code must include parsers +for each of these languages. + +[^language]: When calling `gcc` you can choose which language you are compiling + using `-x language` option or you can let `gcc` guess from the extension. + +[^java]: And also Java in the past! + +On the other hand, `gcc` executables are only able to generate code for one +target (x86, MIPS, ARM, *RISC-V*...), that must be chosen when GCC is compiled. +In order to make the porting efforts easier, GCC has a set of tools that +generate the target-dependent code from some configuration files called Machine +Descriptions (MD). + +Putting all this together, source code parsing and AST generation depend on the +input HLL, and the code that runs for each HLL is **selected** when `gcc` runs +(*steps 1*). The intermediate representation, GIMPLE, is target-independent so +everything related with that is **copied** inside the final `gcc` executable +(*steps 2 and 3*). The RTL (Register Transfer Language) representation and +assembly code generation are target-dependent and the code related to that is +**generated** from MD files when GCC is compiled (*steps 4, 5 and +6*)[^rtl-opt]. + +[^rtl-opt]: The RTL optimizer contains many steps, most of them being target + independent. That doesn't really matter here, but those are not generated but + copied from GCC's source, as GIMPLE is. + +This all means if we want to be able to read the source code of GCC we have to +have clear in mind how the source code maps to the actual executable, i.e. if +we generate a `gcc` executable for **x86** it won't contain the code for other +architectures and **it won't even check if it was correctly programmed**, +because it's not going to compile it. + + +#### GCC as a coordinator {#gcc-coordinator} + +Many GCC users or C programmers (or me, not that long ago) might think there is +something missing on the list of steps we recently reviewed. The normal usecase +for calling `gcc` like + +``` bash +$ gcc -o helloworld helloworld.c +``` + +does several steps internally that we need to separate: + +1. Preprocessing: the resolution of preprocessor macros like `#define` and + stuff like that. +2. Compiling to assembly: the generation of assembly code files per compilation + unit (a file that is the output of the preprocessor). +3. Assembly: the conversion from an assembly file to ELF object file. +4. Linking: the executable or library generation from the ELF object files + created in the previous step. + +The reality is there's more than one program involved here and `gcc` is just a +coordinator that makes other programs run if needed. + +The preprocessor is called `cpp` and it is generated from the GCC codebase. The +compiler is `gcc` itself but the assembler and the linker are generally +obtained from GNU Binutils' `as` and `ld` respectively. + +So, one of the most important things to understand is GCC only generates +assembly, but it looks like it doesn't[^tinycc-asm]. + +This means we need a proper support for our architecture on the assembler and +the linker too. But we'll keep that story for another day[^post-long]. + +[^tinycc-asm]: Other compilers have different approaches for this. For example, + TinyCC generates machine code directly, without the intermediate assembly + file generation step, and is also able to link the files by itself. + +[^post-long]: This post is already long enough and we only made the + introduction. + +--- + +### Souce code parsing {#parsing} + +> HLL dependent. Generates appropriate IR + +So the first step of the compiler is to process the input text and convert it +to the appropriate Intermediate Representation. The most used intermediate +representation is GENERIC, that was designed for C but fits other procedural +languages pretty well[^fortran]. + +[^fortran]: The case of FORTRAN is a little bit weird, as it generates its own + representation that is later converted to GENERIC, we don't really care about + this at this point. + +This parsing process not really relevant for us, as we want to add a new +target, but it's interesting to note because it gives shape to the codebase. +GCC splits the code for the different input languages in folders named like: +`gcc/$LANGUAGE`. + +#### GENERIC {#generic} + +GENERIC is just a representation, we don't need to care that much about it but +a word is not going to hurt anyone. GENERIC is a tree representation: a set of +nodes with some extra common information. Those nodes can be read at +`gcc/tree.def`. + +A simple example of this could be a function declaration, that would take the +a node of type `FUNCTION_DECL` that has some sub-nodes: one for the return +type, another for the body of the function and another for the arguments of the +function. + +It's a simple AST you could come up with yourselves, except the fact that it is +pretty complex. 😅 + +### GIMPLE {#gimple} + +> HLL- and Target- independent representation + +The next step is called *Gimplification* (see `gimplify.c`), the process of +converting to GIMPLE. Normally, representing the AST as GIMPLE is too complex +to be done in one step, so the GENERIC (+ some extensions) is used as a +previous step that is easier to create. + +GIMPLE is the central internal representation of GCC. It's target-independent +and High-Level-Language-independent. At this point some optimizations can be +applied, those related with the structure of the source code, like loop +unfolding or dead code elimination. + +From the porting perspective, this representation is important, as it's the +border line between the front-end and the back-end, and we are interested in +the latter. A really interesting part is to understand is how is this converted +to the next representation, RTL. + +### Register Transfer Language (RTL) {#rtl} + +> Target-dependent low level representation + +The next part of the compiler work is done using the RTL intermediate +representation. The RTL representation is based on LISP, so we have a reason to +love it, and it serves two purposes: + +1. Specify target properties via the Machine Descriptor files. These Machine + Descriptor files are text files that look like LISP and are processed at + compilation time. +2. Represent a compilation. Meaning that the RTL is also an intermediate + representation, a low-level one, that represents sets of instructions. + +GCC does not make any distinction between the first and the second purpose, +calling both RTL, but there some difference on the purpose and the shape of the +RTL. RTL has both, an internal form represented by structures (case 2) and an +external form represented as a text file (case 1). + +The RTL is formed by a set of objects: expression, integers, wide integers, +strings or vectors. In the textual form they are represented like in LISP, +using double quotes for strings, brackets for vectors... and a lot of +parenthesis. The internal representation you can imagine, structures for +expressions, integer types for integers, `char*` for strings, etc. + +The most interesting RTL objects are expressions, aka RTX, that are just a name +(an expression code) plus the amount of possible arguments. + +This is how a piece of RTL may look, it represents an instruction that sets the +register 0 to the result of the addition of the register 1 and the constant +integer 10 (see `rtl.def` for more information): + +``` lisp +(set (reg 0) + (plus (reg 1) + (const_int 10))) +``` + +In the example the only things that are not expressions are the numbers (0, 1 +and 10), all the rest you can find in `rtl.def` and see what they mean. + +From GIMPLE, there are two steps left to reach our target, assembly code, and +both involve RTL. The first maps the GIMPLE nodes to pattern names in a +target-independent way, generating a list of RTL `insn`s. The second matches +those `insn` lists to RTL templates described in Machine Description files and +uses those matches to generate the final assembly code. + +Those `insn`s are objects that represent code in RTL. Each function is +described with a doubly-linked list of `insn`s. You can think about them as +*instructions* in the RTL world. + +In the first step, the RTL `insn` generation step, only the names matter (and +they are hardcoded in the compiler), while in the second the structure of the +`insn` is going to be analyzed as we'll see later. + +#### Target-dependent code {#target-dependent} + +As we previously said, target-dependent steps are generated at compile time and +then inserted in the final `gcc` executable. All this code is located in one +folder per target, under `gcc/config/$TARGET`, so the CFG is able to load the +target we choose at compile time (using `--target=`) and insert that in the +final executable. + +That is done in different ways depending on the type of file we are working +with: Machine Description files are processed by the programs (`gencodes`, +`gentags`...) that generate C code files from them, while target description +macros and functions, which are C files, are inserted in the building process +as any other C file. + +I'd like to insist here on the fact that the `--target` is the only one to be +processed and loaded and the other possible targets are going to be ignored. +The build process is not going to complain if a target is broken or anything +like that if it isn't the target we chose. It just doesn't care. + +#### Machine Description files {#md} + +Machine Description files (`.md` extension) let us define `insn` patterns, +which are incomplete RTL expressions that can be matched against the `insn` +list generated from the GIMPLE, `attributes` and other interesting things we +may not try to decipher here. + +`define_insn` is a RTX we can use to define new `insn` patterns. It receives +four or five operands: + +1. An optional name. It's going to be used to match against GIMPLE. +2. An RTL template. A vector of *incomplete* RTL expressions which describe how + should the instruction look like. *Incomplete* in this context means it uses + expressions like `match_operand` or `match_operator` which are designed to + match against the RTL `insn` list and see if they are compatible or not. +3. A condition. A final condition to say if the `insn` matches this pattern or + not. +4. An output template. A string that contains the output assembly code for this + `insn`. The string can contain special characters like `%` to define where + should the arguments be inserted. If the output is very complex we can write + C code on this field too. +5. An optional list of attributes. + +This is an actual example from the RISC-V code we are backporting: + +``` lisp +(define_insn "adddi3" + [(set (match_operand:DI 0 "register_operand" "=r,r") + (plus:DI (match_operand:DI 1 "register_operand" "r,r") + (match_operand:DI 2 "arith_operand" "r,I")))] + "TARGET_64BIT" + "add\t%0,%1,%2" + [(set_attr "type" "arith") + (set_attr "mode" "DI")]) +``` + +You can see the name `adddi3` is something like: `add` + `di` + `3`. This means +it's the `add` instruction with the `di` mode and `3` input arguments. That's +the way things are named. + +Next block is a vector with the RTL template. If you try to ignore +`match_operand` expressions you can see the template is not very different to +the RTL example we gave before. In this case it's something like: + +``` lisp +(set (reg 0) + (plus (reg 1) + (reg 2))) +``` + +It's basically storing in the first register the result of the addition of the +other two. + +The next field is the condition. In this case it needs to have `TARGET_64BIT` +defined in order to work because the machine mode is `DI` (we'll explain that +soon). + +The output code is simple, just a RISC-V `add` instruction: + +``` asm +add %0,%1,%2 +``` + +Where `%N` is going to be replaced by the register numbers used as arguments +for this instruction. + +The last field are the attributes, which can be used to define the instruction +size and other kind of things. We are not going to focus on them today. + +##### Machine modes {#mm} + +Machine modes are a way to describe the size of a data object and its +representation. + +- QI: quarter integer +- HI: half integer +- SI: single integer +- DI: double integer +- SF: single floating +- DF: double floating + +And so on. + +The standard `insn` names include machine modes to describe what kind of +instruction they are. The example above is `addddi3`, meaning it uses `di` +machine mode: double integer. That's why it needs the target to be a 64 bit +RISC-V machine. + +Machine modes also appear in some RTL expressions like `plus` or +`match_operand` meaning that they operate in that machine mode, that is, with +that data size and representation. For example `(plus:SI ...)`. + +##### RTL Templates {#rtl-templates} + +`match_*` expressions are what make RTL expressions *incomplete*, because they +are designed to be compared against the `insn` list that comes from the +previous step. + +In the example above we had: + +``` lisp +(set (match_operand:DI 0 "register_operand" "=r,r") + (plus:DI (match_operand:DI 1 "register_operand" "r,r") + (match_operand:DI 2 "arith_operand" "r,I"))) +``` + +`(match_operand N predicate constraint)` is a placeholder for an operand number +`N` of the `insn`. When the `insn` is constructed, the `match_operand` will be +replaced by the corresponding operand of the `insn`. When the template is +trying to match an `insn` the `match_operand` forces the operand number `N` to +match the `predicate` in order to make the `insn` match the template. +The `match_*` expressions are what defines how `insn`s should be. + +The `predicate` is a function name to be called. The function receives two +input arguments: an expression and a machine mode. If the function returns `0` +the function does not match. + +`predicate`s can also be combined in Machine Description files like this: + +``` lisp +(define_predicate "arith_operand" + (ior (match_operand 0 "const_arith_operand") + (match_operand 0 "register_operand"))) +``` + +So the `arith_operand` shown in the example above can be a +`const_arith_operand` *or* (that's what `ior` means) a `register_operand`. +They can be more complex but this is more than enough to understand how they +are built. In the end, they always check against C functions, but you can +combine them with the convenience of the Machine Description files. + +The `constraint` allows to fine-tune the matching. They define if the argument +is in a register or the memory and stuff like that. `r`, for example, means the +operand comes from a register. + + + +There are other matching expressions too, but `match_operand` is the most used +one and it's the one that explains this concept of *incomplete* expressions the +best. + + + +#### Target description macros and functions {#target-desc} + +Apart from the machine descriptor files, there are other files involved. For +example, the constraints defined above need to be defined in code somewhere. + +The most important of these are the target description macros and functions, +normally defined in `gcc/config/$TARGET/$TARGET?(.h|.c)`. The `.c` should +initialize the `targetm` variable, which contains all the machine information +relevant to the compiler. It is initialized like this: + +``` c +struct gcc_target targetm = TARGET_INITIALIZER; +``` + +That `TARGET_INITIALIZER` is a huge macro, defined in `gcc/target.h`, that +initializes the `targetm` structure. This macro is split in smaller macros with +reasonable defaults that may be overwritten by pieces. Each target should have +a file that includes both `target.h` and `target-def.h` and overwrites any +inappropriate default by redefining new macros and ends with the initialization +line we just introduced. This is normally done in +`gcc/config/$TARGET/$TARGET.c`, while the `.h` is normally used to define some +macros that are needed in the `.c` file. + +As a reference, the RISC-V code we need to backport (see +`gcc/config/riscv/riscv.c`) uses the file to introduce the amount of registers, +the type, the size, and that kind of things, and many others. + +All this information contained in `targetm` is used by the compiler to decide +how registers have to be allocated, which ones have preference, the cost of +them, and many other things. + +### Assembly code generation {#assembly} + +Having the previous step clear is enough to understand how does the assembly +generation work. Each of the `insn`s in the list obtained from GIMPLE is going +to be compared against the RTL templates and the best match is going to be +chosen. Once the match is chosen, the corresponding assembly is going to be +generated from the corresponding field of the `define_insn` RTL expression. + +As simple as that, but also that complex. + +Why do I say it's complex? Because many things have to be considered and GCC +does consider them. Each instruction has a size, that has to be considered to +calculate addresses, but also they have some execution time associated and GCC +calculates the best matches to make the final assembly file as optimum as +possible. + +The RTL step has a lot of optimization passes, too. It's a complex step but +it's not really important for us because we just need to make a temporary +compiler that lets us compile a better one. It doesn't really matter if it's +not perfect, at least at this point. + + +### Summary {#summary} + +So, in summary, the process is the following: + +1. The HLL language is parsed to a tree, normally GENERIC. +2. GENERIC is converted to GIMPLE. +3. GIMPLE optimizations are applied. +4. GIMPLE is matched to a `insn` list using pattern names. +6. The `insn` list is matched against the RTL templates defined in the Machine + Description files. +7. RTL optimizations are applied. +8. The matches convert the RTL to assembly code also taking in account the + information obtained from the target definition macros and functions. + +From our perspective, the most important things to remember are these: + +- The front-end is not very relevant for us, from the parsing to GIMPLE we can + ignore for the moment. +- RTL step is pretty complex, and the GIMPLE->RTL conversion is too. +- GCC is a compiler collection that has a very powerful compilation process, + the Compiler Generator Framework (CFG), in order to modularize the code and + make it easier to port. +- The machine description files and the target definition macros and functions + are designed to make the porting process simpler. Those are the only files we + need to touch. + +--- + +
+If you like my job here, consider hiring ElenQ +Technology.
Even if I'm busy with this, I still have some time slots +available. +
+ +--- + + +### My job in the backport {#job} + +With all the process more or less clear, we can be more specific on the job I +need to do. I share some specifics in this section, so if you like reading code +you are going to have some fun[^examples]. + +[^examples]: I'll link to some examples of the code on RISC-V's GitHub account. + This code is already merged in GCC. + +First, I need to make sure all the used RTL expressions are compatible with the +old version of the compiler. If they are not, I have to translate them to the +old way to make them. Some examples of this are iterators like +[`(define_int_iterator ...)`][int-iterator] which is not available in old GCC +versions, so I need to unfold a couple of loops by hand and make them only use +the old constructs. + +[int-iterator]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.md?plain=1#L342 + +Second, I need to convert the target description macros and functions to the +old internal C-based API instead of the more [modern C++-based one][c++api], as +the recent port uses. These changes involve many layers and I didn't yet +analyze this in detail. They can be simple like converting from `rtx_insn` +class to `rtx`, the older way to do this. But they can also be complex, like +removing the 40% of the `#include` directives from `riscv.c`, which has many +that were not available in the past. It's going to be a lot of fun I predict. + +[c++api]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.c + +Third, as this whole compilation process is complex, I decided to make it as +accessible as possible, so other people can audit and replicate my work. For +that I'm using Guix, my package manager of choice. I added a [`guix.scm`][guix] +and [`channels.scm`][channels] file to the repository so my work can be +replicated precisely by myself in the future, or by others[^github]. + +The Guix package also provides a better interaction with the building process +of GCC, letting us replace inputs in a very simple way. I'm thinking about the +next steps of the project here, when we need to compile my backported compiler +with TinyCC, test if it works and then patch TinyCC until it does. Having the +`guix.scm` file makes it easy to replace the current compiler with a patched +TinyCC and ensures nothing is interfering in the compilation process because +the compilation is done in an isolated container. + +[^github]: I'm hosting this on GitHub at the moment because the repository is + huge. I'll probably move all this to my server and edit the post after that. + +[guix]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/guix.scm +[channels]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/channels.scm + +That's mostly the job I need to do in the backport. + +Something to keep in mind is that we don't need to make it perfect, we just +need it to work. The backported GCC is not going to be used as a production +compiler, but just as a bridge with the next GCC version, so **there's only one +program it needs to be able to compile correctly: GCC 7**. Once we make the +bridge with the next version, we can use that to compile anything we want. + + +### Last words + +I know this post is long, and the lack proper diagrams make everything a little +bit hard to understand. That's exactly how I felt reading about GCC, but the +difference was I had to read some documentation that is... About 100 times +longer than this post (see [Learn more](#more) below). Not that bad after all. + +There are many things I decided to leave out, like peephole optimizations, +instruction attributes, and some other constructs that are not that important +from my perspective. You may want to make your research on those on your own. + +In any case, if you have any question you can always contact me[^contact] and +ask me any questions you have or send me some words of support. + +In the next post I'll describe a little bit about ELF, the executable and +linkable format, just the bare minimum to understand the format, as it will be +relevant for us in the future. And you might be thinking, why is it relevant if +GCC compiles to assembly? Well, that's one of the questions that we will be +answering in the next post. + +Now I leave you with a couple of interesting links on the next section. + +Good luck with your ports! + + +[^contact]: You can find my contact info in the [About + page](/pages/about.html). + + + +### Learn more {#more} + +- [The GCC internals documentation](https://gcc.gnu.org/onlinedocs/gccint/): if + you are interested on my work you should read an older version of the + documentation. See [Disclaimer](#disclaimer). +- [The GCC source code](https://gcc.gnu.org/git.html): of course, this has + everything you need to understand GCC, but the problem is that GCC is a huge + codebase, you probably need to spend months reading it in order to understand + everything. That's why I think posts like this one are interesting, they help + you focus on the parts you are interested in. + diff --git a/content/bootstrapGcc/02_elf.md b/content/bootstrapGcc/02_elf.md new file mode 100644 index 0000000..c684c63 --- /dev/null +++ b/content/bootstrapGcc/02_elf.md @@ -0,0 +1,595 @@ +Title: ELF format — why not? +Date: 2022-03-14 +Category: +Tags: Bootstrapping GCC in RISC-V +Slug: bootstrapGcc2 +Lang: en +Summary: + Some introduction to ELF as we'll need to deal with this in the future. + +In the [previous post]({filename}01_internals.md) of the +[series]({tag}Bootstrapping GCC in RISC-V) we introduced GCC and how it +generates assembly code and we left a question unanswered: *"Why is learning +about ELF interesting if GCC generates assembly?"*. In this post we are going +to answer that question (not interesting) and maybe understand the very basics +of ELF file format (more interesting). + + +### What's ELF + +ELF is a file format with two main goals: + +- Represent an executable file +- Represent a linkable file + +Apart from that, ELF can also represent core dumps, but if you think about that +all of the possible options have something in common: they represent contents +on the memory. We can simply say ELF is a file format that acts as a picture of +the state of the memory. In the case of the executables, the state will be +loaded from the file, but in the case of the core dumps the state is obtained +from the memory and dumped in a file. + +Linkable files are those files that can be combined with others to generate +executables or shared objects, so they can also fit that definition because +they are going to end up in the memory anyway. + +For efficiency reasons, the ELF format has two separate views of the same +contents: + +- The **Linking** view is based on sections and needs a *section header*. +- The **Executable** view is based on segments and needs a *program header*. + +#### ELF header + +The ELF header is the only thing that has a fixed position in the file, at the +beginning. The ELF header has information that defines how to identify the +file, the machine, the endianness and that sort of things, but it also says +where are the headers located and identifies the size of their entries and +their entry count. + +It's not that interesting, honestly. The most important thing is it points to +the descriptions to both of the views (the headers) so we can check them. + +#### Linking view + +Based on sections, the linking view is the most detailed view of the file and +it defines how the file should be linked with others in order to create an +executable file. + +Sections, the basic unit of the linking view, are consecutive sequences of +bytes that do not overlap. + +There are [different types of sections according to their possible contents and +meaning](https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA.junk/sections.html), +the most interesting are: + +- `SYMTAB` and `DYNSYM` that hold a symbol table. The `DYNSYM` is for dynamic + linking symbols, while `SYMTAB` normally is used for static linking but may + contain both. +- `STRTAB` holds a string table. +- `RELA` contains relocation entries with addends and `REL` contains + relocations without addends. +- `NOTE` section contains some information of the file. +- `HASH` contains a symbol hash table, necessary for dynamic linking. +- `DYNAMIC` for dynamic linking information. + +Each section has also a `name`, an `address` if it is supposed to appear in the +memory of running process, an `offset` that defines where in the file do the +section's contents appear, a `size`, and some extra data fields that all +together form a section header entry. + +The section header entries are all located where the ELF header says, one after +the other (like a C array of structures), so the programs just need to access +that position in the file and read all the headers in a row. The contents of +the sections are located throughout the file, where the section headers point. + +##### String section + +The string section (`STRTAB`) is one of the simplest. It contains all the +strings of the file: the section and symbol names. It's simply a set of null +terminated strings, written one after the other (it also starts with a null +character but whatever). + +Anywhere in the file where we are supposed to get an string what we get is an +index that points to the first position in this section to read from. We should +read from that until we reach a null character. For example in the following +string section: + +``` + \0 h e l l o \0 n a m e \0 +``` + +If a name of a section says `1`, the actual name of the section is `hello` +and if it says `7` it would be `name`. Also, if it says `9` it would be `me`, +this trick could be used too. + +##### Symbol table + +The symbol table contains information needed to locate and relocate a program's +symbolic definitions and references. The symbol table is formed as an array of +symbol elements that are defined with a `name`, obviously a `value`, their +`size`, some extra `info`, the index of the section header they relate to +(`shndx`) and some `other` stuff. + +The `info` field manages symbol's type (`OBJECT` for data, `FUNC` for +function...) and binding attributes, which define the linking visibility and +behavior of the symbol (local vs global...). + +The `value` can be interpreted in several ways too, depending on the type of +the symbol you are dealing with. But that's not really relevant for us at the +moment. + +##### Relocation + +According to the ELF documentation I got from somewhere I don't really +remember: + +> The relocation is the process of connecting symbolic references with symbolic +> definitions. + +I hope it's more explanatory for you than what it is to me, but I don't have a +clue of what that is supposed to mean. The +[Wikipedia](https://en.wikipedia.org/wiki/Relocation_(computing)) does a **much +better** job in the specifics right here: + +> Relocation is the process of assigning load addresses for position-dependent +> code and data of a program and adjusting the code and data to reflect the +> assigned addresses. + +If this doesn't really help, you have a really good example later, but we can +basically say that it's a way to adjust the code to point to the correct +addresses, at linking or loading, or even execution, time. + +ELF files have, as we said, sections that let us define relocations. These will +point to some parts of the file and tell the linker or the loader that that +positions of the file must be reprocessed. + +There are two types of relocation sections and in both of them the relocation +section is an array of entries where each of them represents one relocation. +In the simple one (`REL`) each relocation only contains an `offset` and an +`info` word, which also includes the type of relocation to apply. The more +complex one (`RELA`) is mostly the same but it includes an `addend` which +includes a constant value to use in calculation of the relocation. + +The calculus of the final addresses are specific to the ISA and the relocation +type, because processors have different instruction formats and different ways +to pack addresses in instructions. RISC-V has no way to pack a full address +inside of an instruction, while x86 does, so they have to patch the +instructions in a different way. + + +##### Special sections + +Some sections have a special treatment according to their name, normally the +ones that start with a dot. These you might have found in the past in assembly +files, defined like `.data` (for data), `.rodata` (for read only data) or +`.text` (for code). + +These are interesting to have in mind because they appear the same way they do +in assembly, and we are going to disassemble some of them and play around with +them. + +Other special sections like `.got` or `.dynamic` don't appear in assembly but +they have a strong meaning in the resulting file, we are not going to deal with +those today because we want to finish this post someday. If you need to deal +with those I recommend you to read ELF's documentation on special sections and +the loading process. + + +#### Executable view + +The executable view is another way to access the same contents, but with a +different perspective. It's based on *segments* rather than *sections*. +Segments are also pieces of the file, as sections are, but segments can contain +one or more sections. + +Like in the linking view, the base unit, sections for the linking view but for +segments for the executable view, are described in a header. The header of the +executable view is called program header and it is, like the section header, a +bunch of structures piled together, each describing one of the segments. + +The program header describes the position and size in the file of each of the +segments but also some important information about them: how they are supposed +to be loaded in the memory and where (virtual address and physical address), +the type of the segment, and some info more. + +The most interesting segment types are the following: + +- `LOAD` is used for loadable segments, with the other fields of the segment + the position and the size this segment will have in memory are described. +- `DYNAMIC` are segments that have some dynamic linking information. It has to + contain the `.dynamic` section. +- `INTERP` gives the location and size of a null-terminated path name to invoke + as an *interpreter*. Interpreter in this context usually means a dynamic + linker, which will be called instead of loading this file to memory and the + dynamic linker will be the one that will load the parts of the file it + considers. + +You can see how segments are interesting for loading the file in the memory, +that is, they are mostly interesting for executable files or shared objects. + +#### Segments vs Sections + +If you want to have a clear idea about the difference between segments and +sections, you can consider a file with multiple sections: `.text`, `.rodata` +and `.data`. + +A file that contains those sections can be understood from a linking +perspective as a file that has some code (`.text`), read-only data (`.rodata`) +and read-write data (`.data`). Each of those parts must be managed in a +different way by the linker, but the reality is that the program loader doesn't +really care about some of the differences of them. + +The code and the read-only data are loaded in the memory in the same way, with +read and execute permission but no write permission, so the executable view can +put both sections in the same segment, and make the loader's life easier. + +Also, the linker doesn't really care about how is the memory loaded so the +section header does not hold that information. It does care about the section's +goals though, as it will need to put them together in order during the linking. +On the other hand, the loader is not really interested on what's the goal of +the contents of the file but only on what to do with those contents, so it only +has that information. + + +### So, why do we need to learn it? + +We don't really need to learn it very deeply, just learn how it works in a +high-level way and make sure we are able to read it with the tools we have +available. The good news for you is if the reasons I give you are not good +enough it doesn't really matter because you already learned[^gotcha]. Continue +reading and you'll realize how much you understand now. + +[^gotcha]: Ha! Gotcha! + +First, let me tell you a personal story. I have previous experience working +with assembly, but only in small devices that have two memories, one for data +and other for code (Hardvard Architecture). In those small devices you often +don't really need to think about how the code and the data is mapped to memory +because your programs are small and the separation is clear. Computers are a +different thing, and I have had issues understanding this whole assembly thing. + +Computers store both code and data in the same memory, the main memory, (Von +Neumann Architecture) and they normally have memory segmentation, pagination, +memory management units and all that kind of stuff, because there are many +processes running and they want to separate one from the other. That forces us +to think about how the code and the data are mapped to the memory. Also, modern +operating systems also use dynamic linkers, which are not available in small +devices, and we need to be able to deal with that amount of complexity. + +ELF allows us to make that all, because it was born for that. ELF is a +distillation of many of the ideas from System V Unix, that include exactly all +I mentioned. It's a great way to understand how memory, linking and processes +work in a *modern* operating system. This is why you need to learn it, at least +a little. It makes you a cultivated person, which is always good[^system-v]. + +[^system-v]: It also makes you understand the complexities of the system so you + can criticize it. Changing the world requires to learn about it first. + +#### The specifics + +As I'm sure you are not satisfied totally with the answer of being a cultivated +person[^some-of-you], let me go for some specifics. + +[^some-of-you]: For those that really are. That's the good attitude in life. + High five. You can read the whole section still, it has interesting points I + think. + +So in this project GCC is not the only software we are dealing with, GNU +Binutils and TinyCC are part of the party too, and I need to make them fit +together in the best way possible. In those I need to make sure the +relocations, formats and other things work properly, following the RISC-V ABI +specification for ELF. That might be a point of failure, so being prepared on a +high-level at least is interesting. + +Of course, GCC's output we need to analyze too, and in order to do that we need +to make sure we know what it means. We already saw that some ELF sections are +directly mentioned in the assembly, so in order to know their meanings ELF is a +good way to understand them. They are really an OS related thing and ELF only +reflects it, but learning them from the ELF perspective makes the path easier +probably. + +Relocations are a huge point in all this mess, because they are machine +specific (instructions are too, but those I expect us to know already), and +they are something I didn't need to research on all the RISC-V adventures I had +last year. I have to do it sometime. + +In general, there are many sharp edges where we can get hurt, so it's better if +we wear gloves. + +### Tools + +For all this process there are a couple of tools that were designed to help. +GNU Binutils has many of them but we are going to focus on two, as they are +more than enough for many usecases: `objdump` and `readelf`. + +The example below uses both of them to analyze a piece of code and its +compilation result. As you'll see, the main problem they have is their output: +it's not always clear, the formatting is a little bit chaotic, it's not +obvious at all to get right and it's really hard to use it procedurally. + +There is a really cool tool you should investigate though, called GNU Poke, +that is designed specifically to fight against those issues. I recommend you to +[take a look to it](https://www.gnu.org/software/poke/). + +### Example + +Starting from a very simple C file we can follow a really interesting process +and understand some of the ELF internals: + +``` c +long global_symbol; + +int main() { + return global_symbol != 0; +} +``` + +We compile it to assembly with: + +``` asdf +$ riscv64-linux-gnu-gcc -S b.c -O0 +``` + +This are the contents of the assembly file: + +``` asm + .file "b.c" + .option pic + .text + .globl global_symbol + .bss + .align 3 + .type global_symbol, @object + .size global_symbol, 8 +global_symbol: + .zero 8 + .text + .align 1 + .globl main + .type main, @function +main: + addi sp,sp,-16 + sd s0,8(sp) + addi s0,sp,16 + lla a5,global_symbol + ld a5,0(a5) + snez a5,a5 + andi a5,a5,0xff + sext.w a5,a5 + mv a0,a5 + ld s0,8(sp) + addi sp,sp,16 + jr ra + .size main, .-main + .ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110" + .section .note.GNU-stack,"",@progbits +``` + +Assemble the file with `as`: + +``` asdf +$ riscv64-linux-gnu-as b.s -o b.o +``` + +And this is what we get in `b.o`. The `.text` section contains the following: + +``` asm +$ riscv64-linux-gnu-objdump --disassemble b.o + +b.o: file format elf64-littleriscv + + +Disassembly of section .text: + +0000000000000000
: + 0: ff010113 addi sp,sp,-16 + 4: 00813423 sd s0,8(sp) + 8: 01010413 addi s0,sp,16 + c: 00000797 auipc a5,0x0 + 10: 00078793 mv a5,a5 + 14: 0007b783 ld a5,0(a5) # c + 18: 00f037b3 snez a5,a5 + 1c: 0ff7f793 andi a5,a5,255 + 20: 0007879b sext.w a5,a5 + 24: 00078513 mv a0,a5 + 28: 00813403 ld s0,8(sp) + 2c: 01010113 addi sp,sp,16 + 30: 00008067 ret +``` + +### Relocations + +There are some relocations! + +``` asdf +$ riscv64-linux-gnu-objdump b.o -r + +b.o: file format elf64-littleriscv + +RELOCATION RECORDS FOR [.text]: +OFFSET TYPE VALUE +000000000000000c R_RISCV_PCREL_HI20 global_symbol +000000000000000c R_RISCV_RELAX *ABS* +0000000000000010 R_RISCV_PCREL_LO12_I .L0 +0000000000000010 R_RISCV_RELAX *ABS* +``` + +But in order to understand those relocations properly we need to check the +value of the symbols too: + +``` asdf +$ riscv64-linux-gnu-objdump -t b.o + +b.o: file format elf64-littleriscv + +SYMBOL TABLE: +0000000000000000 l df *ABS* 0000000000000000 b.c +0000000000000000 l d .text 0000000000000000 .text +0000000000000000 l d .data 0000000000000000 .data +0000000000000000 l d .bss 0000000000000000 .bss +0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack +000000000000000c l .text 0000000000000000 .L0 +0000000000000000 l d .comment 0000000000000000 .comment +0000000000000000 g O .bss 0000000000000008 global_symbol +0000000000000000 g F .text 0000000000000034 main +``` + +If you pay attention to the offsets of those relocations (`0x0c` and `0x10`) +they exactly match the instructions `auipc a5, 0x0` and `mv a5, a5` and those +are expanded from the `lla a5, global_symbol` (load local address) +pseudoinstruction from the assembly. + +The `mv` is not really a `mv`. `mv` is a pseudoinstruction too, that should be +expanded to an `addi a5, a5, 0`. The `objdump` is playing with us, making the +opposite conversion so we can read better but in fact is tricking us. + +The `auipc` + `addi` couple in RISC-V appears pretty often, because it's the +method it has to load addresses in memory. The first instruction, `auipc` adds +a high part of an immediate to the program counter and stores the result in a +register, the `addi` adds then another, in this case low, immediate to the +register i.e. they make a `x[reg] = pc + immediate` operation in two steps: +`x[reg] = pc + hi20(immediate)` followed by `x[reg] = x[reg] + lo12(immediate)`. + +As we have relocations in both `auipc` and `addi` this means their `0` values +(the immediates) are going to be overwritten with something else at linking +time, and there's when RISC-V has something to say. All the relocations we can +see are RISC-V specific, and you can read about them in [RISC-V ABI +Specification](https://github.com/riscv-non-isa/riscv-elf-psabi-doc). + +In our case we have really some simple ones, the easiest to understand (what a +coincidence, huh?): + +> `R_RISCV_PCREL_HI20`: High 20 bits of 32-bit PC-relative reference, +> `%pcrel_hi(symbol)`. The formula is: `S+A-P` [but only obtains the highest 20 +> bits]. + +> `R_RISCV_PCREL_LO12_I`: Low 12 bits of a 32-bit PC-relative, +> `%pcrel_lo(address of %pcrel_hi)`, the addend must be 0. The formula is: +> `S-P` [but it only obtains the lowest 12 bits]. + +Both the `HI20` and the `LO12` have a similar formula, this is the meaning of +the elements on the formula: + +- `S`: Address of the symbol +- `A`: Addend of the relocation +- `P`: Position of the relocation + +If you match their formulas with the description of what we just said about how +do `auipc` + `addi` couples work, you can easily understand the formulas and +their meaning. We are not going to do it, do something yourself! + +The other relocation: + +> `R_RISCV_RELAX`: Instruction can be relaxed, paired with a normal relocation +> at the same address. + +Is an addition our example doesn't use but it could. The `R_RISCV_RELAX` +basically means that if the relocation it points at is not needed it can be +discarded. And when does that happen? Easy, when we can get `global_symbol`'s +address with only one of them, we can remove the other instruction from the +program. + +#### Relocation resolution + +If we link the file and generate an executable, we can see the final value +those zeroes get. + +``` asdf +$ riscv64-linux-gnu-gcc b.o -o b.out +``` +We link it like this because `ld` needs a lot of input fields and we don't want +to set them all by hand, but you can do it with `ld` if you feel like it. + +``` asdf +$ riscv64-linux-gnu-objdump --disassemble b.out +... +00000000000005e4
: + 5e4: ff010113 addi sp,sp,-16 + 5e8: 00813423 sd s0,8(sp) + 5ec: 01010413 addi s0,sp,16 + 5f0: 00002797 auipc a5,0x2 + 5f4: a6878793 addi a5,a5,-1432 # 2058 + 5f8: 0007b783 ld a5,0(a5) + 5fc: 00f037b3 snez a5,a5 + 600: 0ff7f793 andi a5,a5,255 + 604: 0007879b sext.w a5,a5 + 608: 00078513 mv a0,a5 + 60c: 00813403 ld s0,8(sp) + 610: 01010113 addi sp,sp,16 + 614: 00008067 ret +... +``` + +There you see the relocation was resolved (`0x5f0` and `0x5f4`) by the linker +and the final values have been added. `objdump` is intelligent enough to tell +us where are those instructions pointing (says `2058 `). Just to +make sure we can search in the symbol table for the `global_symbol`: + +``` asdf +$ riscv64-linux-gnu-objdump -t b.out | grep global_symbol +0000000000002058 g O .bss 0000000000000008 global_symbol +``` + +> NOTE: We could try to calculate the address of the `global_symbol` as the +> linker did, but it's a little bit complicated because we also linked the file +> with the standard library and the startup files, which adds the `crt` files +> on top of the file. It's really that we get more code than what we had in the +> assembly file. If you want to see that, you can see the rest of the output of +> the command, or even try with `--disassemble-all` and calculate the symbol +> address by hand. Good luck. + +#### More sections + +If you want the review some simple things, like a string section, you can use +`readelf` for that. The `-p` flag (equivalent to `--string-dump=`) displays the +contents of the section as strings. You can read the `.comment` section that +way: + +``` asdf +$ riscv64-linux-gnu-readelf -p .comment b.o + +String dump of section '.comment': + [ 1] GCC: (Debian 10.2.1-6) 10.2.1 20210110 +``` + +This is what we had inserted in `.ident` on the assembly file by the compiler. +We have it in the binary too. + +In other distros the output is a little bit different. Look the output we have +in Guix: + +``` asdf +String dump of section '.comment': + [ 1] GCC: (GNU) 11.2.0 +``` + +### Conclusion + +So this whole this just to explain that ELF files are some kind of dual files +that have two different goals at the same time. The executable one is kind of a +picture of the memory state that can be used for loading that state in the +memory, while the linking one just describes how different parts of the +contents relate to each other and has tons of funny tricks to make the files +relocatable, position independent and that kind of things. Cool. + +There are still many fields of ELF we didn't talk about but I consider this +introduction more than enough. Having a simple understanding about how is the +file organized and what kind of information it has is probably enough for the +things we are going to need. + +The proposed example shows that with the knowledge obtained by this short +introduction we can dig a little bit on the files that result from a +compilation and analyze their internals. That's mostly the work I'll need to do +when I start combining compilers in a pipeline of death and destruction. + +If I ever need to dig on something deeper, I'll do. + +Anyway, I'm still unsure if I answered the question we left in the previous +post[^cliff]: + +> Why is learning about ELF interesting if GCC generates assembly? + +Did I? + +[^cliff]: It was a good cliffhanger, though. diff --git a/content/bootstrapGcc/NGIAssure.svg b/content/bootstrapGcc/NGIAssure.svg new file mode 100644 index 0000000..62b7c12 --- /dev/null +++ b/content/bootstrapGcc/NGIAssure.svg @@ -0,0 +1 @@ +Logo-NGIAssure-tag diff --git a/content/bootstrapGcc/nlnet.svg b/content/bootstrapGcc/nlnet.svg new file mode 100644 index 0000000..373c8d8 --- /dev/null +++ b/content/bootstrapGcc/nlnet.svg @@ -0,0 +1,34 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + -- cgit v1.2.3