Title: GCC internals — From a porting perspective Date: 2022-03-08 Category: Tags: Bootstrapping GCC in RISC-V Slug: bootstrapGcc1 Lang: en Summary: Deep diving into GCC's internals from the perspective of someone who wants to port GCC for a new architecture. In the [previous post]({filename}00_intro.md) of the [series]({tag}Bootstrapping GCC in RISC-V) the problem of the GCC bootstrapping was introduced. In this post we'll describe how GCC works, from the perspective of someone who wants to port it so we understand what's the job we have to do. 1. [Disclaimer](#disclaimer) 2. [Overview](#intro) 1. [The compiler generation framework](#cgf) 2. [GCC as a coordinator](#gcc-coordinator) 3. [Source code parsing](#parsing) 1. [GENERIC](#generic) 4. [GIMPLE](#gimple) 5. [Register Transfer Language](#rtl) 1. [Target-dependent code](#target-dependent) 2. [Machine description files](#md) 1. [Machine modes](#mm) 2. [RTL Templates](#rtl-templates) 3. [Target description macros and functions](#target-desc) 6. [Assembly code generation](#assembly) 7. [Summary](#summary) 8. [My job in the backport](#job) 9. [Last words](#last) 10. [Learn more](#more) ### Disclaimer {#disclaimer} - This post may be only valid for old GCC versions, like 4.something, because that's the one I'm interested in. More recent versions may have different details, but I don't expect them to be very different to what is described here. More specifically: I'm working on GCC 4.6.4, and the first GCC with RISC-V support is GCC 7.0.0. - This post will focus on how GCC compiles C programs because that's the part we care about. Some other languages have differences on how they are treated but that's not very relevant for us, as it has no implications on the *back-end*. Both of these points will get clearer later. ### Overview {#intro} GCC is structured as a pipeline of several steps that run one after the other. 1. Source code parsing 2. GIMPLE IR generation (target-independent) 3. Some GIMPLE tree optimizations 4. RTL IR generation (target-dependent) 5. RTL optimizer 6. Assembly code generator Before starting to analyze each of the steps independently there are a couple of things to clarify. #### The Compiler Generation Framework {#cfg} An important point to note is GCC is a **compiler collection** meaning that it is able to compile code from many high level languages (HLL) and for many different targets. This has implications on how some steps are mapped to GCC's source code. The most important thing of all this is to differentiate between GCC's code and an actual `gcc` executable. The key point here is that GCC's codebase includes what is called CGF (Compiler Generator Framework) that can generate `gcc` executables from GCC's code. The CGF generates `gcc` executables according to the input (target machine, host machine...) we give it, but the generated `gcc` executables may differ one from another even if they were generated from the same codebase. Any `gcc` executable is able to compile any input HLL[^language] (C, C++, Objective-C, Ada, Fortran and Go[^java]), so GCC's code must include parsers for each of these languages. [^language]: When calling `gcc` you can choose which language you are compiling using `-x language` option or you can let `gcc` guess from the extension. [^java]: And also Java in the past! On the other hand, `gcc` executables are only able to generate code for one target (x86, MIPS, ARM, *RISC-V*...), that must be chosen when GCC is compiled. In order to make the porting efforts easier, GCC has a set of tools that generate the target-dependent code from some configuration files called Machine Descriptions (MD). Putting all this together, source code parsing and AST generation depend on the input HLL, and the code that runs for each HLL is **selected** when `gcc` runs (*steps 1*). The intermediate representation, GIMPLE, is target-independent so everything related with that is **copied** inside the final `gcc` executable (*steps 2 and 3*). The RTL (Register Transfer Language) representation and assembly code generation are target-dependent and the code related to that is **generated** from MD files when GCC is compiled (*steps 4, 5 and 6*)[^rtl-opt]. [^rtl-opt]: The RTL optimizer contains many steps, most of them being target independent. That doesn't really matter here, but those are not generated but copied from GCC's source, as GIMPLE is. This all means if we want to be able to read the source code of GCC we have to have clear in mind how the source code maps to the actual executable, i.e. if we generate a `gcc` executable for **x86** it won't contain the code for other architectures and **it won't even check if it was correctly programmed**, because it's not going to compile it. #### GCC as a coordinator {#gcc-coordinator} Many GCC users or C programmers (or me, not that long ago) might think there is something missing on the list of steps we recently reviewed. The normal usecase for calling `gcc` like ``` bash $ gcc -o helloworld helloworld.c ``` does several steps internally that we need to separate: 1. Preprocessing: the resolution of preprocessor macros like `#define` and stuff like that. 2. Compiling to assembly: the generation of assembly code files per compilation unit (a file that is the output of the preprocessor). 3. Assembly: the conversion from an assembly file to ELF object file. 4. Linking: the executable or library generation from the ELF object files created in the previous step. The reality is there's more than one program involved here and `gcc` is just a coordinator that makes other programs run if needed. The preprocessor is called `cpp` and it is generated from the GCC codebase. The compiler is `gcc` itself but the assembler and the linker are generally obtained from GNU Binutils' `as` and `ld` respectively. So, one of the most important things to understand is GCC only generates assembly, but it looks like it doesn't[^tinycc-asm]. This means we need a proper support for our architecture on the assembler and the linker too. But we'll keep that story for another day[^post-long]. [^tinycc-asm]: Other compilers have different approaches for this. For example, TinyCC generates machine code directly, without the intermediate assembly file generation step, and is also able to link the files by itself. [^post-long]: This post is already long enough and we only made the introduction. --- ### Souce code parsing {#parsing} > HLL dependent. Generates appropriate IR So the first step of the compiler is to process the input text and convert it to the appropriate Intermediate Representation. The most used intermediate representation is GENERIC, that was designed for C but fits other procedural languages pretty well[^fortran]. [^fortran]: The case of FORTRAN is a little bit weird, as it generates its own representation that is later converted to GENERIC, we don't really care about this at this point. This parsing process not really relevant for us, as we want to add a new target, but it's interesting to note because it gives shape to the codebase. GCC splits the code for the different input languages in folders named like: `gcc/$LANGUAGE`. #### GENERIC {#generic} GENERIC is just a representation, we don't need to care that much about it but a word is not going to hurt anyone. GENERIC is a tree representation: a set of nodes with some extra common information. Those nodes can be read at `gcc/tree.def`. A simple example of this could be a function declaration, that would take the a node of type `FUNCTION_DECL` that has some sub-nodes: one for the return type, another for the body of the function and another for the arguments of the function. It's a simple AST you could come up with yourselves, except the fact that it is pretty complex. 😅 ### GIMPLE {#gimple} > HLL- and Target- independent representation The next step is called *Gimplification* (see `gimplify.c`), the process of converting to GIMPLE. Normally, representing the AST as GIMPLE is too complex to be done in one step, so the GENERIC (+ some extensions) is used as a previous step that is easier to create. GIMPLE is the central internal representation of GCC. It's target-independent and High-Level-Language-independent. At this point some optimizations can be applied, those related with the structure of the source code, like loop unfolding or dead code elimination. From the porting perspective, this representation is important, as it's the border line between the front-end and the back-end, and we are interested in the latter. A really interesting part is to understand is how is this converted to the next representation, RTL. ### Register Transfer Language (RTL) {#rtl} > Target-dependent low level representation The next part of the compiler work is done using the RTL intermediate representation. The RTL representation is based on LISP, so we have a reason to love it, and it serves two purposes: 1. Specify target properties via the Machine Descriptor files. These Machine Descriptor files are text files that look like LISP and are processed at compilation time. 2. Represent a compilation. Meaning that the RTL is also an intermediate representation, a low-level one, that represents sets of instructions. GCC does not make any distinction between the first and the second purpose, calling both RTL, but there some difference on the purpose and the shape of the RTL. RTL has both, an internal form represented by structures (case 2) and an external form represented as a text file (case 1). The RTL is formed by a set of objects: expression, integers, wide integers, strings or vectors. In the textual form they are represented like in LISP, using double quotes for strings, brackets for vectors... and a lot of parenthesis. The internal representation you can imagine, structures for expressions, integer types for integers, `char*` for strings, etc. The most interesting RTL objects are expressions, aka RTX, that are just a name (an expression code) plus the amount of possible arguments. This is how a piece of RTL may look, it represents an instruction that sets the register 0 to the result of the addition of the register 1 and the constant integer 10 (see `rtl.def` for more information): ``` lisp (set (reg 0) (plus (reg 1) (const_int 10))) ``` In the example the only things that are not expressions are the numbers (0, 1 and 10), all the rest you can find in `rtl.def` and see what they mean. From GIMPLE, there are two steps left to reach our target, assembly code, and both involve RTL. The first maps the GIMPLE nodes to pattern names in a target-independent way, generating a list of RTL `insn`s. The second matches those `insn` lists to RTL templates described in Machine Description files and uses those matches to generate the final assembly code. Those `insn`s are objects that represent code in RTL. Each function is described with a doubly-linked list of `insn`s. You can think about them as *instructions* in the RTL world. In the first step, the RTL `insn` generation step, only the names matter (and they are hardcoded in the compiler), while in the second the structure of the `insn` is going to be analyzed as we'll see later. #### Target-dependent code {#target-dependent} As we previously said, target-dependent steps are generated at compile time and then inserted in the final `gcc` executable. All this code is located in one folder per target, under `gcc/config/$TARGET`, so the CFG is able to load the target we choose at compile time (using `--target=`) and insert that in the final executable. That is done in different ways depending on the type of file we are working with: Machine Description files are processed by the programs (`gencodes`, `gentags`...) that generate C code files from them, while target description macros and functions, which are C files, are inserted in the building process as any other C file. I'd like to insist here on the fact that the `--target` is the only one to be processed and loaded and the other possible targets are going to be ignored. The build process is not going to complain if a target is broken or anything like that if it isn't the target we chose. It just doesn't care. #### Machine Description files {#md} Machine Description files (`.md` extension) let us define `insn` patterns, which are incomplete RTL expressions that can be matched against the `insn` list generated from the GIMPLE, `attributes` and other interesting things we may not try to decipher here. `define_insn` is a RTX we can use to define new `insn` patterns. It receives four or five operands: 1. An optional name. It's going to be used to match against GIMPLE. 2. An RTL template. A vector of *incomplete* RTL expressions which describe how should the instruction look like. *Incomplete* in this context means it uses expressions like `match_operand` or `match_operator` which are designed to match against the RTL `insn` list and see if they are compatible or not. 3. A condition. A final condition to say if the `insn` matches this pattern or not. 4. An output template. A string that contains the output assembly code for this `insn`. The string can contain special characters like `%` to define where should the arguments be inserted. If the output is very complex we can write C code on this field too. 5. An optional list of attributes. This is an actual example from the RISC-V code we are backporting: ``` lisp (define_insn "adddi3" [(set (match_operand:DI 0 "register_operand" "=r,r") (plus:DI (match_operand:DI 1 "register_operand" "r,r") (match_operand:DI 2 "arith_operand" "r,I")))] "TARGET_64BIT" "add\t%0,%1,%2" [(set_attr "type" "arith") (set_attr "mode" "DI")]) ``` You can see the name `adddi3` is something like: `add` + `di` + `3`. This means it's the `add` instruction with the `di` mode and `3` input arguments. That's the way things are named. Next block is a vector with the RTL template. If you try to ignore `match_operand` expressions you can see the template is not very different to the RTL example we gave before. In this case it's something like: ``` lisp (set (reg 0) (plus (reg 1) (reg 2))) ``` It's basically storing in the first register the result of the addition of the other two. The next field is the condition. In this case it needs to have `TARGET_64BIT` defined in order to work because the machine mode is `DI` (we'll explain that soon). The output code is simple, just a RISC-V `add` instruction: ``` asm add %0,%1,%2 ``` Where `%N` is going to be replaced by the register numbers used as arguments for this instruction. The last field are the attributes, which can be used to define the instruction size and other kind of things. We are not going to focus on them today. ##### Machine modes {#mm} Machine modes are a way to describe the size of a data object and its representation. - QI: quarter integer - HI: half integer - SI: single integer - DI: double integer - SF: single floating - DF: double floating And so on. The standard `insn` names include machine modes to describe what kind of instruction they are. The example above is `addddi3`, meaning it uses `di` machine mode: double integer. That's why it needs the target to be a 64 bit RISC-V machine. Machine modes also appear in some RTL expressions like `plus` or `match_operand` meaning that they operate in that machine mode, that is, with that data size and representation. For example `(plus:SI ...)`. ##### RTL Templates {#rtl-templates} `match_*` expressions are what make RTL expressions *incomplete*, because they are designed to be compared against the `insn` list that comes from the previous step. In the example above we had: ``` lisp (set (match_operand:DI 0 "register_operand" "=r,r") (plus:DI (match_operand:DI 1 "register_operand" "r,r") (match_operand:DI 2 "arith_operand" "r,I"))) ``` `(match_operand N predicate constraint)` is a placeholder for an operand number `N` of the `insn`. When the `insn` is constructed, the `match_operand` will be replaced by the corresponding operand of the `insn`. When the template is trying to match an `insn` the `match_operand` forces the operand number `N` to match the `predicate` in order to make the `insn` match the template. The `match_*` expressions are what defines how `insn`s should be. The `predicate` is a function name to be called. The function receives two input arguments: an expression and a machine mode. If the function returns `0` the function does not match. `predicate`s can also be combined in Machine Description files like this: ``` lisp (define_predicate "arith_operand" (ior (match_operand 0 "const_arith_operand") (match_operand 0 "register_operand"))) ``` So the `arith_operand` shown in the example above can be a `const_arith_operand` *or* (that's what `ior` means) a `register_operand`. They can be more complex but this is more than enough to understand how they are built. In the end, they always check against C functions, but you can combine them with the convenience of the Machine Description files. The `constraint` allows to fine-tune the matching. They define if the argument is in a register or the memory and stuff like that. `r`, for example, means the operand comes from a register. There are other matching expressions too, but `match_operand` is the most used one and it's the one that explains this concept of *incomplete* expressions the best. #### Target description macros and functions {#target-desc} Apart from the machine descriptor files, there are other files involved. For example, the constraints defined above need to be defined in code somewhere. The most important of these are the target description macros and functions, normally defined in `gcc/config/$TARGET/$TARGET?(.h|.c)`. The `.c` should initialize the `targetm` variable, which contains all the machine information relevant to the compiler. It is initialized like this: ``` c struct gcc_target targetm = TARGET_INITIALIZER; ``` That `TARGET_INITIALIZER` is a huge macro, defined in `gcc/target.h`, that initializes the `targetm` structure. This macro is split in smaller macros with reasonable defaults that may be overwritten by pieces. Each target should have a file that includes both `target.h` and `target-def.h` and overwrites any inappropriate default by redefining new macros and ends with the initialization line we just introduced. This is normally done in `gcc/config/$TARGET/$TARGET.c`, while the `.h` is normally used to define some macros that are needed in the `.c` file. As a reference, the RISC-V code we need to backport (see `gcc/config/riscv/riscv.c`) uses the file to introduce the amount of registers, the type, the size, and that kind of things, and many others. All this information contained in `targetm` is used by the compiler to decide how registers have to be allocated, which ones have preference, the cost of them, and many other things. ### Assembly code generation {#assembly} Having the previous step clear is enough to understand how does the assembly generation work. Each of the `insn`s in the list obtained from GIMPLE is going to be compared against the RTL templates and the best match is going to be chosen. Once the match is chosen, the corresponding assembly is going to be generated from the corresponding field of the `define_insn` RTL expression. As simple as that, but also that complex. Why do I say it's complex? Because many things have to be considered and GCC does consider them. Each instruction has a size, that has to be considered to calculate addresses, but also they have some execution time associated and GCC calculates the best matches to make the final assembly file as optimum as possible. The RTL step has a lot of optimization passes, too. It's a complex step but it's not really important for us because we just need to make a temporary compiler that lets us compile a better one. It doesn't really matter if it's not perfect, at least at this point. ### Summary {#summary} So, in summary, the process is the following: 1. The HLL language is parsed to a tree, normally GENERIC. 2. GENERIC is converted to GIMPLE. 3. GIMPLE optimizations are applied. 4. GIMPLE is matched to a `insn` list using pattern names. 6. The `insn` list is matched against the RTL templates defined in the Machine Description files. 7. RTL optimizations are applied. 8. The matches convert the RTL to assembly code also taking in account the information obtained from the target definition macros and functions. From our perspective, the most important things to remember are these: - The front-end is not very relevant for us, from the parsing to GIMPLE we can ignore for the moment. - RTL step is pretty complex, and the GIMPLE->RTL conversion is too. - GCC is a compiler collection that has a very powerful compilation process, the Compiler Generator Framework (CFG), in order to modularize the code and make it easier to port. - The machine description files and the target definition macros and functions are designed to make the porting process simpler. Those are the only files we need to touch. ---