From 57ebc576bf74ca887f9c77cb6b7771b9ff2b1843 Mon Sep 17 00:00:00 2001 From: Ekaitz Zarraga Date: Mon, 14 Mar 2022 23:52:34 +0100 Subject: NlNet project posts 0-2 --- content/bootstrapGcc/01_internals.md | 621 +++++++++++++++++++++++++++++++++++ 1 file changed, 621 insertions(+) create mode 100644 content/bootstrapGcc/01_internals.md (limited to 'content/bootstrapGcc/01_internals.md') diff --git a/content/bootstrapGcc/01_internals.md b/content/bootstrapGcc/01_internals.md new file mode 100644 index 0000000..a6009e1 --- /dev/null +++ b/content/bootstrapGcc/01_internals.md @@ -0,0 +1,621 @@ +Title: GCC internals — From a porting perspective +Date: 2022-03-08 +Category: +Tags: Bootstrapping GCC in RISC-V +Slug: bootstrapGcc1 +Lang: en +Summary: + Deep diving into GCC's internals from the perspective of someone who + wants to port GCC for a new architecture. + + +In the [previous post]({filename}00_intro.md) of the +[series]({tag}Bootstrapping GCC in RISC-V) the problem of the GCC bootstrapping +was introduced. In this post we'll describe how GCC works, from the +perspective of someone who wants to port it so we understand what's the job we +have to do. + +1. [Disclaimer](#disclaimer) +2. [Overview](#intro) + 1. [The compiler generation framework](#cgf) + 2. [GCC as a coordinator](#gcc-coordinator) +3. [Source code parsing](#parsing) + 1. [GENERIC](#generic) +4. [GIMPLE](#gimple) +5. [Register Transfer Language](#rtl) + 1. [Target-dependent code](#target-dependent) + 2. [Machine description files](#md) + 1. [Machine modes](#mm) + 2. [RTL Templates](#rtl-templates) + 3. [Target description macros and functions](#target-desc) +6. [Assembly code generation](#assembly) +7. [Summary](#summary) +8. [My job in the backport](#job) +9. [Last words](#last) +10. [Learn more](#more) + +### Disclaimer {#disclaimer} + +- This post may be only valid for old GCC versions, like 4.something, because + that's the one I'm interested in. More recent versions may have different + details, but I don't expect them to be very different to what is described + here. More specifically: I'm working on GCC 4.6.4, and the first GCC with + RISC-V support is GCC 7.0.0. +- This post will focus on how GCC compiles C programs because that's the part + we care about. Some other languages have differences on how they are treated + but that's not very relevant for us, as it has no implications on the + *back-end*. + +Both of these points will get clearer later. + +### Overview {#intro} + +GCC is structured as a pipeline of several steps that run one after the other. + +1. Source code parsing +2. GIMPLE IR generation (target-independent) +3. Some GIMPLE tree optimizations +4. RTL IR generation (target-dependent) +5. RTL optimizer +6. Assembly code generator + +Before starting to analyze each of the steps independently there are a couple +of things to clarify. + +#### The Compiler Generation Framework {#cfg} + +An important point to note is GCC is a **compiler collection** meaning that it +is able to compile code from many high level languages (HLL) and for many +different targets. This has implications on how some steps are mapped to GCC's +source code. + +The most important thing of all this is to differentiate between GCC's code and +an actual `gcc` executable. The key point here is that GCC's codebase includes +what is called CGF (Compiler Generator Framework) that can generate `gcc` +executables from GCC's code. The CGF generates `gcc` executables according to +the input (target machine, host machine...) we give it, but the generated `gcc` +executables may differ one from another even if they were generated from the +same codebase. + +Any `gcc` executable is able to compile any input HLL[^language] (C, C++, +Objective-C, Ada, Fortran and Go[^java]), so GCC's code must include parsers +for each of these languages. + +[^language]: When calling `gcc` you can choose which language you are compiling + using `-x language` option or you can let `gcc` guess from the extension. + +[^java]: And also Java in the past! + +On the other hand, `gcc` executables are only able to generate code for one +target (x86, MIPS, ARM, *RISC-V*...), that must be chosen when GCC is compiled. +In order to make the porting efforts easier, GCC has a set of tools that +generate the target-dependent code from some configuration files called Machine +Descriptions (MD). + +Putting all this together, source code parsing and AST generation depend on the +input HLL, and the code that runs for each HLL is **selected** when `gcc` runs +(*steps 1*). The intermediate representation, GIMPLE, is target-independent so +everything related with that is **copied** inside the final `gcc` executable +(*steps 2 and 3*). The RTL (Register Transfer Language) representation and +assembly code generation are target-dependent and the code related to that is +**generated** from MD files when GCC is compiled (*steps 4, 5 and +6*)[^rtl-opt]. + +[^rtl-opt]: The RTL optimizer contains many steps, most of them being target + independent. That doesn't really matter here, but those are not generated but + copied from GCC's source, as GIMPLE is. + +This all means if we want to be able to read the source code of GCC we have to +have clear in mind how the source code maps to the actual executable, i.e. if +we generate a `gcc` executable for **x86** it won't contain the code for other +architectures and **it won't even check if it was correctly programmed**, +because it's not going to compile it. + + +#### GCC as a coordinator {#gcc-coordinator} + +Many GCC users or C programmers (or me, not that long ago) might think there is +something missing on the list of steps we recently reviewed. The normal usecase +for calling `gcc` like + +``` bash +$ gcc -o helloworld helloworld.c +``` + +does several steps internally that we need to separate: + +1. Preprocessing: the resolution of preprocessor macros like `#define` and + stuff like that. +2. Compiling to assembly: the generation of assembly code files per compilation + unit (a file that is the output of the preprocessor). +3. Assembly: the conversion from an assembly file to ELF object file. +4. Linking: the executable or library generation from the ELF object files + created in the previous step. + +The reality is there's more than one program involved here and `gcc` is just a +coordinator that makes other programs run if needed. + +The preprocessor is called `cpp` and it is generated from the GCC codebase. The +compiler is `gcc` itself but the assembler and the linker are generally +obtained from GNU Binutils' `as` and `ld` respectively. + +So, one of the most important things to understand is GCC only generates +assembly, but it looks like it doesn't[^tinycc-asm]. + +This means we need a proper support for our architecture on the assembler and +the linker too. But we'll keep that story for another day[^post-long]. + +[^tinycc-asm]: Other compilers have different approaches for this. For example, + TinyCC generates machine code directly, without the intermediate assembly + file generation step, and is also able to link the files by itself. + +[^post-long]: This post is already long enough and we only made the + introduction. + +--- + +### Souce code parsing {#parsing} + +> HLL dependent. Generates appropriate IR + +So the first step of the compiler is to process the input text and convert it +to the appropriate Intermediate Representation. The most used intermediate +representation is GENERIC, that was designed for C but fits other procedural +languages pretty well[^fortran]. + +[^fortran]: The case of FORTRAN is a little bit weird, as it generates its own + representation that is later converted to GENERIC, we don't really care about + this at this point. + +This parsing process not really relevant for us, as we want to add a new +target, but it's interesting to note because it gives shape to the codebase. +GCC splits the code for the different input languages in folders named like: +`gcc/$LANGUAGE`. + +#### GENERIC {#generic} + +GENERIC is just a representation, we don't need to care that much about it but +a word is not going to hurt anyone. GENERIC is a tree representation: a set of +nodes with some extra common information. Those nodes can be read at +`gcc/tree.def`. + +A simple example of this could be a function declaration, that would take the +a node of type `FUNCTION_DECL` that has some sub-nodes: one for the return +type, another for the body of the function and another for the arguments of the +function. + +It's a simple AST you could come up with yourselves, except the fact that it is +pretty complex. 😅 + +### GIMPLE {#gimple} + +> HLL- and Target- independent representation + +The next step is called *Gimplification* (see `gimplify.c`), the process of +converting to GIMPLE. Normally, representing the AST as GIMPLE is too complex +to be done in one step, so the GENERIC (+ some extensions) is used as a +previous step that is easier to create. + +GIMPLE is the central internal representation of GCC. It's target-independent +and High-Level-Language-independent. At this point some optimizations can be +applied, those related with the structure of the source code, like loop +unfolding or dead code elimination. + +From the porting perspective, this representation is important, as it's the +border line between the front-end and the back-end, and we are interested in +the latter. A really interesting part is to understand is how is this converted +to the next representation, RTL. + +### Register Transfer Language (RTL) {#rtl} + +> Target-dependent low level representation + +The next part of the compiler work is done using the RTL intermediate +representation. The RTL representation is based on LISP, so we have a reason to +love it, and it serves two purposes: + +1. Specify target properties via the Machine Descriptor files. These Machine + Descriptor files are text files that look like LISP and are processed at + compilation time. +2. Represent a compilation. Meaning that the RTL is also an intermediate + representation, a low-level one, that represents sets of instructions. + +GCC does not make any distinction between the first and the second purpose, +calling both RTL, but there some difference on the purpose and the shape of the +RTL. RTL has both, an internal form represented by structures (case 2) and an +external form represented as a text file (case 1). + +The RTL is formed by a set of objects: expression, integers, wide integers, +strings or vectors. In the textual form they are represented like in LISP, +using double quotes for strings, brackets for vectors... and a lot of +parenthesis. The internal representation you can imagine, structures for +expressions, integer types for integers, `char*` for strings, etc. + +The most interesting RTL objects are expressions, aka RTX, that are just a name +(an expression code) plus the amount of possible arguments. + +This is how a piece of RTL may look, it represents an instruction that sets the +register 0 to the result of the addition of the register 1 and the constant +integer 10 (see `rtl.def` for more information): + +``` lisp +(set (reg 0) + (plus (reg 1) + (const_int 10))) +``` + +In the example the only things that are not expressions are the numbers (0, 1 +and 10), all the rest you can find in `rtl.def` and see what they mean. + +From GIMPLE, there are two steps left to reach our target, assembly code, and +both involve RTL. The first maps the GIMPLE nodes to pattern names in a +target-independent way, generating a list of RTL `insn`s. The second matches +those `insn` lists to RTL templates described in Machine Description files and +uses those matches to generate the final assembly code. + +Those `insn`s are objects that represent code in RTL. Each function is +described with a doubly-linked list of `insn`s. You can think about them as +*instructions* in the RTL world. + +In the first step, the RTL `insn` generation step, only the names matter (and +they are hardcoded in the compiler), while in the second the structure of the +`insn` is going to be analyzed as we'll see later. + +#### Target-dependent code {#target-dependent} + +As we previously said, target-dependent steps are generated at compile time and +then inserted in the final `gcc` executable. All this code is located in one +folder per target, under `gcc/config/$TARGET`, so the CFG is able to load the +target we choose at compile time (using `--target=`) and insert that in the +final executable. + +That is done in different ways depending on the type of file we are working +with: Machine Description files are processed by the programs (`gencodes`, +`gentags`...) that generate C code files from them, while target description +macros and functions, which are C files, are inserted in the building process +as any other C file. + +I'd like to insist here on the fact that the `--target` is the only one to be +processed and loaded and the other possible targets are going to be ignored. +The build process is not going to complain if a target is broken or anything +like that if it isn't the target we chose. It just doesn't care. + +#### Machine Description files {#md} + +Machine Description files (`.md` extension) let us define `insn` patterns, +which are incomplete RTL expressions that can be matched against the `insn` +list generated from the GIMPLE, `attributes` and other interesting things we +may not try to decipher here. + +`define_insn` is a RTX we can use to define new `insn` patterns. It receives +four or five operands: + +1. An optional name. It's going to be used to match against GIMPLE. +2. An RTL template. A vector of *incomplete* RTL expressions which describe how + should the instruction look like. *Incomplete* in this context means it uses + expressions like `match_operand` or `match_operator` which are designed to + match against the RTL `insn` list and see if they are compatible or not. +3. A condition. A final condition to say if the `insn` matches this pattern or + not. +4. An output template. A string that contains the output assembly code for this + `insn`. The string can contain special characters like `%` to define where + should the arguments be inserted. If the output is very complex we can write + C code on this field too. +5. An optional list of attributes. + +This is an actual example from the RISC-V code we are backporting: + +``` lisp +(define_insn "adddi3" + [(set (match_operand:DI 0 "register_operand" "=r,r") + (plus:DI (match_operand:DI 1 "register_operand" "r,r") + (match_operand:DI 2 "arith_operand" "r,I")))] + "TARGET_64BIT" + "add\t%0,%1,%2" + [(set_attr "type" "arith") + (set_attr "mode" "DI")]) +``` + +You can see the name `adddi3` is something like: `add` + `di` + `3`. This means +it's the `add` instruction with the `di` mode and `3` input arguments. That's +the way things are named. + +Next block is a vector with the RTL template. If you try to ignore +`match_operand` expressions you can see the template is not very different to +the RTL example we gave before. In this case it's something like: + +``` lisp +(set (reg 0) + (plus (reg 1) + (reg 2))) +``` + +It's basically storing in the first register the result of the addition of the +other two. + +The next field is the condition. In this case it needs to have `TARGET_64BIT` +defined in order to work because the machine mode is `DI` (we'll explain that +soon). + +The output code is simple, just a RISC-V `add` instruction: + +``` asm +add %0,%1,%2 +``` + +Where `%N` is going to be replaced by the register numbers used as arguments +for this instruction. + +The last field are the attributes, which can be used to define the instruction +size and other kind of things. We are not going to focus on them today. + +##### Machine modes {#mm} + +Machine modes are a way to describe the size of a data object and its +representation. + +- QI: quarter integer +- HI: half integer +- SI: single integer +- DI: double integer +- SF: single floating +- DF: double floating + +And so on. + +The standard `insn` names include machine modes to describe what kind of +instruction they are. The example above is `addddi3`, meaning it uses `di` +machine mode: double integer. That's why it needs the target to be a 64 bit +RISC-V machine. + +Machine modes also appear in some RTL expressions like `plus` or +`match_operand` meaning that they operate in that machine mode, that is, with +that data size and representation. For example `(plus:SI ...)`. + +##### RTL Templates {#rtl-templates} + +`match_*` expressions are what make RTL expressions *incomplete*, because they +are designed to be compared against the `insn` list that comes from the +previous step. + +In the example above we had: + +``` lisp +(set (match_operand:DI 0 "register_operand" "=r,r") + (plus:DI (match_operand:DI 1 "register_operand" "r,r") + (match_operand:DI 2 "arith_operand" "r,I"))) +``` + +`(match_operand N predicate constraint)` is a placeholder for an operand number +`N` of the `insn`. When the `insn` is constructed, the `match_operand` will be +replaced by the corresponding operand of the `insn`. When the template is +trying to match an `insn` the `match_operand` forces the operand number `N` to +match the `predicate` in order to make the `insn` match the template. +The `match_*` expressions are what defines how `insn`s should be. + +The `predicate` is a function name to be called. The function receives two +input arguments: an expression and a machine mode. If the function returns `0` +the function does not match. + +`predicate`s can also be combined in Machine Description files like this: + +``` lisp +(define_predicate "arith_operand" + (ior (match_operand 0 "const_arith_operand") + (match_operand 0 "register_operand"))) +``` + +So the `arith_operand` shown in the example above can be a +`const_arith_operand` *or* (that's what `ior` means) a `register_operand`. +They can be more complex but this is more than enough to understand how they +are built. In the end, they always check against C functions, but you can +combine them with the convenience of the Machine Description files. + +The `constraint` allows to fine-tune the matching. They define if the argument +is in a register or the memory and stuff like that. `r`, for example, means the +operand comes from a register. + + + +There are other matching expressions too, but `match_operand` is the most used +one and it's the one that explains this concept of *incomplete* expressions the +best. + + + +#### Target description macros and functions {#target-desc} + +Apart from the machine descriptor files, there are other files involved. For +example, the constraints defined above need to be defined in code somewhere. + +The most important of these are the target description macros and functions, +normally defined in `gcc/config/$TARGET/$TARGET?(.h|.c)`. The `.c` should +initialize the `targetm` variable, which contains all the machine information +relevant to the compiler. It is initialized like this: + +``` c +struct gcc_target targetm = TARGET_INITIALIZER; +``` + +That `TARGET_INITIALIZER` is a huge macro, defined in `gcc/target.h`, that +initializes the `targetm` structure. This macro is split in smaller macros with +reasonable defaults that may be overwritten by pieces. Each target should have +a file that includes both `target.h` and `target-def.h` and overwrites any +inappropriate default by redefining new macros and ends with the initialization +line we just introduced. This is normally done in +`gcc/config/$TARGET/$TARGET.c`, while the `.h` is normally used to define some +macros that are needed in the `.c` file. + +As a reference, the RISC-V code we need to backport (see +`gcc/config/riscv/riscv.c`) uses the file to introduce the amount of registers, +the type, the size, and that kind of things, and many others. + +All this information contained in `targetm` is used by the compiler to decide +how registers have to be allocated, which ones have preference, the cost of +them, and many other things. + +### Assembly code generation {#assembly} + +Having the previous step clear is enough to understand how does the assembly +generation work. Each of the `insn`s in the list obtained from GIMPLE is going +to be compared against the RTL templates and the best match is going to be +chosen. Once the match is chosen, the corresponding assembly is going to be +generated from the corresponding field of the `define_insn` RTL expression. + +As simple as that, but also that complex. + +Why do I say it's complex? Because many things have to be considered and GCC +does consider them. Each instruction has a size, that has to be considered to +calculate addresses, but also they have some execution time associated and GCC +calculates the best matches to make the final assembly file as optimum as +possible. + +The RTL step has a lot of optimization passes, too. It's a complex step but +it's not really important for us because we just need to make a temporary +compiler that lets us compile a better one. It doesn't really matter if it's +not perfect, at least at this point. + + +### Summary {#summary} + +So, in summary, the process is the following: + +1. The HLL language is parsed to a tree, normally GENERIC. +2. GENERIC is converted to GIMPLE. +3. GIMPLE optimizations are applied. +4. GIMPLE is matched to a `insn` list using pattern names. +6. The `insn` list is matched against the RTL templates defined in the Machine + Description files. +7. RTL optimizations are applied. +8. The matches convert the RTL to assembly code also taking in account the + information obtained from the target definition macros and functions. + +From our perspective, the most important things to remember are these: + +- The front-end is not very relevant for us, from the parsing to GIMPLE we can + ignore for the moment. +- RTL step is pretty complex, and the GIMPLE->RTL conversion is too. +- GCC is a compiler collection that has a very powerful compilation process, + the Compiler Generator Framework (CFG), in order to modularize the code and + make it easier to port. +- The machine description files and the target definition macros and functions + are designed to make the porting process simpler. Those are the only files we + need to touch. + +--- + +
+If you like my job here, consider hiring ElenQ +Technology.
Even if I'm busy with this, I still have some time slots +available. +
+ +--- + + +### My job in the backport {#job} + +With all the process more or less clear, we can be more specific on the job I +need to do. I share some specifics in this section, so if you like reading code +you are going to have some fun[^examples]. + +[^examples]: I'll link to some examples of the code on RISC-V's GitHub account. + This code is already merged in GCC. + +First, I need to make sure all the used RTL expressions are compatible with the +old version of the compiler. If they are not, I have to translate them to the +old way to make them. Some examples of this are iterators like +[`(define_int_iterator ...)`][int-iterator] which is not available in old GCC +versions, so I need to unfold a couple of loops by hand and make them only use +the old constructs. + +[int-iterator]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.md?plain=1#L342 + +Second, I need to convert the target description macros and functions to the +old internal C-based API instead of the more [modern C++-based one][c++api], as +the recent port uses. These changes involve many layers and I didn't yet +analyze this in detail. They can be simple like converting from `rtx_insn` +class to `rtx`, the older way to do this. But they can also be complex, like +removing the 40% of the `#include` directives from `riscv.c`, which has many +that were not available in the past. It's going to be a lot of fun I predict. + +[c++api]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.c + +Third, as this whole compilation process is complex, I decided to make it as +accessible as possible, so other people can audit and replicate my work. For +that I'm using Guix, my package manager of choice. I added a [`guix.scm`][guix] +and [`channels.scm`][channels] file to the repository so my work can be +replicated precisely by myself in the future, or by others[^github]. + +The Guix package also provides a better interaction with the building process +of GCC, letting us replace inputs in a very simple way. I'm thinking about the +next steps of the project here, when we need to compile my backported compiler +with TinyCC, test if it works and then patch TinyCC until it does. Having the +`guix.scm` file makes it easy to replace the current compiler with a patched +TinyCC and ensures nothing is interfering in the compilation process because +the compilation is done in an isolated container. + +[^github]: I'm hosting this on GitHub at the moment because the repository is + huge. I'll probably move all this to my server and edit the post after that. + +[guix]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/guix.scm +[channels]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/channels.scm + +That's mostly the job I need to do in the backport. + +Something to keep in mind is that we don't need to make it perfect, we just +need it to work. The backported GCC is not going to be used as a production +compiler, but just as a bridge with the next GCC version, so **there's only one +program it needs to be able to compile correctly: GCC 7**. Once we make the +bridge with the next version, we can use that to compile anything we want. + + +### Last words + +I know this post is long, and the lack proper diagrams make everything a little +bit hard to understand. That's exactly how I felt reading about GCC, but the +difference was I had to read some documentation that is... About 100 times +longer than this post (see [Learn more](#more) below). Not that bad after all. + +There are many things I decided to leave out, like peephole optimizations, +instruction attributes, and some other constructs that are not that important +from my perspective. You may want to make your research on those on your own. + +In any case, if you have any question you can always contact me[^contact] and +ask me any questions you have or send me some words of support. + +In the next post I'll describe a little bit about ELF, the executable and +linkable format, just the bare minimum to understand the format, as it will be +relevant for us in the future. And you might be thinking, why is it relevant if +GCC compiles to assembly? Well, that's one of the questions that we will be +answering in the next post. + +Now I leave you with a couple of interesting links on the next section. + +Good luck with your ports! + + +[^contact]: You can find my contact info in the [About + page](/pages/about.html). + + + +### Learn more {#more} + +- [The GCC internals documentation](https://gcc.gnu.org/onlinedocs/gccint/): if + you are interested on my work you should read an older version of the + documentation. See [Disclaimer](#disclaimer). +- [The GCC source code](https://gcc.gnu.org/git.html): of course, this has + everything you need to understand GCC, but the problem is that GCC is a huge + codebase, you probably need to spend months reading it in order to understand + everything. That's why I think posts like this one are interesting, they help + you focus on the parts you are interested in. + -- cgit v1.2.3