summaryrefslogtreecommitdiff
path: root/content/bootstrapGcc/01_internals.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/bootstrapGcc/01_internals.md')
-rw-r--r--content/bootstrapGcc/01_internals.md621
1 files changed, 621 insertions, 0 deletions
diff --git a/content/bootstrapGcc/01_internals.md b/content/bootstrapGcc/01_internals.md
new file mode 100644
index 0000000..a6009e1
--- /dev/null
+++ b/content/bootstrapGcc/01_internals.md
@@ -0,0 +1,621 @@
+Title: GCC internals — From a porting perspective
+Date: 2022-03-08
+Category:
+Tags: Bootstrapping GCC in RISC-V
+Slug: bootstrapGcc1
+Lang: en
+Summary:
+ Deep diving into GCC's internals from the perspective of someone who
+ wants to port GCC for a new architecture.
+
+
+In the [previous post]({filename}00_intro.md) of the
+[series]({tag}Bootstrapping GCC in RISC-V) the problem of the GCC bootstrapping
+was introduced. In this post we'll describe how GCC works, from the
+perspective of someone who wants to port it so we understand what's the job we
+have to do.
+
+1. [Disclaimer](#disclaimer)
+2. [Overview](#intro)
+ 1. [The compiler generation framework](#cgf)
+ 2. [GCC as a coordinator](#gcc-coordinator)
+3. [Source code parsing](#parsing)
+ 1. [GENERIC](#generic)
+4. [GIMPLE](#gimple)
+5. [Register Transfer Language](#rtl)
+ 1. [Target-dependent code](#target-dependent)
+ 2. [Machine description files](#md)
+ 1. [Machine modes](#mm)
+ 2. [RTL Templates](#rtl-templates)
+ 3. [Target description macros and functions](#target-desc)
+6. [Assembly code generation](#assembly)
+7. [Summary](#summary)
+8. [My job in the backport](#job)
+9. [Last words](#last)
+10. [Learn more](#more)
+
+### Disclaimer {#disclaimer}
+
+- This post may be only valid for old GCC versions, like 4.something, because
+ that's the one I'm interested in. More recent versions may have different
+ details, but I don't expect them to be very different to what is described
+ here. More specifically: I'm working on GCC 4.6.4, and the first GCC with
+ RISC-V support is GCC 7.0.0.
+- This post will focus on how GCC compiles C programs because that's the part
+ we care about. Some other languages have differences on how they are treated
+ but that's not very relevant for us, as it has no implications on the
+ *back-end*.
+
+Both of these points will get clearer later.
+
+### Overview {#intro}
+
+GCC is structured as a pipeline of several steps that run one after the other.
+
+1. Source code parsing
+2. GIMPLE IR generation (target-independent)
+3. Some GIMPLE tree optimizations
+4. RTL IR generation (target-dependent)
+5. RTL optimizer
+6. Assembly code generator
+
+Before starting to analyze each of the steps independently there are a couple
+of things to clarify.
+
+#### The Compiler Generation Framework {#cfg}
+
+An important point to note is GCC is a **compiler collection** meaning that it
+is able to compile code from many high level languages (HLL) and for many
+different targets. This has implications on how some steps are mapped to GCC's
+source code.
+
+The most important thing of all this is to differentiate between GCC's code and
+an actual `gcc` executable. The key point here is that GCC's codebase includes
+what is called CGF (Compiler Generator Framework) that can generate `gcc`
+executables from GCC's code. The CGF generates `gcc` executables according to
+the input (target machine, host machine...) we give it, but the generated `gcc`
+executables may differ one from another even if they were generated from the
+same codebase.
+
+Any `gcc` executable is able to compile any input HLL[^language] (C, C++,
+Objective-C, Ada, Fortran and Go[^java]), so GCC's code must include parsers
+for each of these languages.
+
+[^language]: When calling `gcc` you can choose which language you are compiling
+ using `-x language` option or you can let `gcc` guess from the extension.
+
+[^java]: And also Java in the past!
+
+On the other hand, `gcc` executables are only able to generate code for one
+target (x86, MIPS, ARM, *RISC-V*...), that must be chosen when GCC is compiled.
+In order to make the porting efforts easier, GCC has a set of tools that
+generate the target-dependent code from some configuration files called Machine
+Descriptions (MD).
+
+Putting all this together, source code parsing and AST generation depend on the
+input HLL, and the code that runs for each HLL is **selected** when `gcc` runs
+(*steps 1*). The intermediate representation, GIMPLE, is target-independent so
+everything related with that is **copied** inside the final `gcc` executable
+(*steps 2 and 3*). The RTL (Register Transfer Language) representation and
+assembly code generation are target-dependent and the code related to that is
+**generated** from MD files when GCC is compiled (*steps 4, 5 and
+6*)[^rtl-opt].
+
+[^rtl-opt]: The RTL optimizer contains many steps, most of them being target
+ independent. That doesn't really matter here, but those are not generated but
+ copied from GCC's source, as GIMPLE is.
+
+This all means if we want to be able to read the source code of GCC we have to
+have clear in mind how the source code maps to the actual executable, i.e. if
+we generate a `gcc` executable for **x86** it won't contain the code for other
+architectures and **it won't even check if it was correctly programmed**,
+because it's not going to compile it.
+
+
+#### GCC as a coordinator {#gcc-coordinator}
+
+Many GCC users or C programmers (or me, not that long ago) might think there is
+something missing on the list of steps we recently reviewed. The normal usecase
+for calling `gcc` like
+
+``` bash
+$ gcc -o helloworld helloworld.c
+```
+
+does several steps internally that we need to separate:
+
+1. Preprocessing: the resolution of preprocessor macros like `#define` and
+ stuff like that.
+2. Compiling to assembly: the generation of assembly code files per compilation
+ unit (a file that is the output of the preprocessor).
+3. Assembly: the conversion from an assembly file to ELF object file.
+4. Linking: the executable or library generation from the ELF object files
+ created in the previous step.
+
+The reality is there's more than one program involved here and `gcc` is just a
+coordinator that makes other programs run if needed.
+
+The preprocessor is called `cpp` and it is generated from the GCC codebase. The
+compiler is `gcc` itself but the assembler and the linker are generally
+obtained from GNU Binutils' `as` and `ld` respectively.
+
+So, one of the most important things to understand is GCC only generates
+assembly, but it looks like it doesn't[^tinycc-asm].
+
+This means we need a proper support for our architecture on the assembler and
+the linker too. But we'll keep that story for another day[^post-long].
+
+[^tinycc-asm]: Other compilers have different approaches for this. For example,
+ TinyCC generates machine code directly, without the intermediate assembly
+ file generation step, and is also able to link the files by itself.
+
+[^post-long]: This post is already long enough and we only made the
+ introduction.
+
+---
+
+### Souce code parsing {#parsing}
+
+> HLL dependent. Generates appropriate IR
+
+So the first step of the compiler is to process the input text and convert it
+to the appropriate Intermediate Representation. The most used intermediate
+representation is GENERIC, that was designed for C but fits other procedural
+languages pretty well[^fortran].
+
+[^fortran]: The case of FORTRAN is a little bit weird, as it generates its own
+ representation that is later converted to GENERIC, we don't really care about
+ this at this point.
+
+This parsing process not really relevant for us, as we want to add a new
+target, but it's interesting to note because it gives shape to the codebase.
+GCC splits the code for the different input languages in folders named like:
+`gcc/$LANGUAGE`.
+
+#### GENERIC {#generic}
+
+GENERIC is just a representation, we don't need to care that much about it but
+a word is not going to hurt anyone. GENERIC is a tree representation: a set of
+nodes with some extra common information. Those nodes can be read at
+`gcc/tree.def`.
+
+A simple example of this could be a function declaration, that would take the
+a node of type `FUNCTION_DECL` that has some sub-nodes: one for the return
+type, another for the body of the function and another for the arguments of the
+function.
+
+It's a simple AST you could come up with yourselves, except the fact that it is
+pretty complex. 😅
+
+### GIMPLE {#gimple}
+
+> HLL- and Target- independent representation
+
+The next step is called *Gimplification* (see `gimplify.c`), the process of
+converting to GIMPLE. Normally, representing the AST as GIMPLE is too complex
+to be done in one step, so the GENERIC (+ some extensions) is used as a
+previous step that is easier to create.
+
+GIMPLE is the central internal representation of GCC. It's target-independent
+and High-Level-Language-independent. At this point some optimizations can be
+applied, those related with the structure of the source code, like loop
+unfolding or dead code elimination.
+
+From the porting perspective, this representation is important, as it's the
+border line between the front-end and the back-end, and we are interested in
+the latter. A really interesting part is to understand is how is this converted
+to the next representation, RTL.
+
+### Register Transfer Language (RTL) {#rtl}
+
+> Target-dependent low level representation
+
+The next part of the compiler work is done using the RTL intermediate
+representation. The RTL representation is based on LISP, so we have a reason to
+love it, and it serves two purposes:
+
+1. Specify target properties via the Machine Descriptor files. These Machine
+ Descriptor files are text files that look like LISP and are processed at
+ compilation time.
+2. Represent a compilation. Meaning that the RTL is also an intermediate
+ representation, a low-level one, that represents sets of instructions.
+
+GCC does not make any distinction between the first and the second purpose,
+calling both RTL, but there some difference on the purpose and the shape of the
+RTL. RTL has both, an internal form represented by structures (case 2) and an
+external form represented as a text file (case 1).
+
+The RTL is formed by a set of objects: expression, integers, wide integers,
+strings or vectors. In the textual form they are represented like in LISP,
+using double quotes for strings, brackets for vectors... and a lot of
+parenthesis. The internal representation you can imagine, structures for
+expressions, integer types for integers, `char*` for strings, etc.
+
+The most interesting RTL objects are expressions, aka RTX, that are just a name
+(an expression code) plus the amount of possible arguments.
+
+This is how a piece of RTL may look, it represents an instruction that sets the
+register 0 to the result of the addition of the register 1 and the constant
+integer 10 (see `rtl.def` for more information):
+
+``` lisp
+(set (reg 0)
+ (plus (reg 1)
+ (const_int 10)))
+```
+
+In the example the only things that are not expressions are the numbers (0, 1
+and 10), all the rest you can find in `rtl.def` and see what they mean.
+
+From GIMPLE, there are two steps left to reach our target, assembly code, and
+both involve RTL. The first maps the GIMPLE nodes to pattern names in a
+target-independent way, generating a list of RTL `insn`s. The second matches
+those `insn` lists to RTL templates described in Machine Description files and
+uses those matches to generate the final assembly code.
+
+Those `insn`s are objects that represent code in RTL. Each function is
+described with a doubly-linked list of `insn`s. You can think about them as
+*instructions* in the RTL world.
+
+In the first step, the RTL `insn` generation step, only the names matter (and
+they are hardcoded in the compiler), while in the second the structure of the
+`insn` is going to be analyzed as we'll see later.
+
+#### Target-dependent code {#target-dependent}
+
+As we previously said, target-dependent steps are generated at compile time and
+then inserted in the final `gcc` executable. All this code is located in one
+folder per target, under `gcc/config/$TARGET`, so the CFG is able to load the
+target we choose at compile time (using `--target=`) and insert that in the
+final executable.
+
+That is done in different ways depending on the type of file we are working
+with: Machine Description files are processed by the programs (`gencodes`,
+`gentags`...) that generate C code files from them, while target description
+macros and functions, which are C files, are inserted in the building process
+as any other C file.
+
+I'd like to insist here on the fact that the `--target` is the only one to be
+processed and loaded and the other possible targets are going to be ignored.
+The build process is not going to complain if a target is broken or anything
+like that if it isn't the target we chose. It just doesn't care.
+
+#### Machine Description files {#md}
+
+Machine Description files (`.md` extension) let us define `insn` patterns,
+which are incomplete RTL expressions that can be matched against the `insn`
+list generated from the GIMPLE, `attributes` and other interesting things we
+may not try to decipher here.
+
+`define_insn` is a RTX we can use to define new `insn` patterns. It receives
+four or five operands:
+
+1. An optional name. It's going to be used to match against GIMPLE.
+2. An RTL template. A vector of *incomplete* RTL expressions which describe how
+ should the instruction look like. *Incomplete* in this context means it uses
+ expressions like `match_operand` or `match_operator` which are designed to
+ match against the RTL `insn` list and see if they are compatible or not.
+3. A condition. A final condition to say if the `insn` matches this pattern or
+ not.
+4. An output template. A string that contains the output assembly code for this
+ `insn`. The string can contain special characters like `%` to define where
+ should the arguments be inserted. If the output is very complex we can write
+ C code on this field too.
+5. An optional list of attributes.
+
+This is an actual example from the RISC-V code we are backporting:
+
+``` lisp
+(define_insn "adddi3"
+ [(set (match_operand:DI 0 "register_operand" "=r,r")
+ (plus:DI (match_operand:DI 1 "register_operand" "r,r")
+ (match_operand:DI 2 "arith_operand" "r,I")))]
+ "TARGET_64BIT"
+ "add\t%0,%1,%2"
+ [(set_attr "type" "arith")
+ (set_attr "mode" "DI")])
+```
+
+You can see the name `adddi3` is something like: `add` + `di` + `3`. This means
+it's the `add` instruction with the `di` mode and `3` input arguments. That's
+the way things are named.
+
+Next block is a vector with the RTL template. If you try to ignore
+`match_operand` expressions you can see the template is not very different to
+the RTL example we gave before. In this case it's something like:
+
+``` lisp
+(set (reg 0)
+ (plus (reg 1)
+ (reg 2)))
+```
+
+It's basically storing in the first register the result of the addition of the
+other two.
+
+The next field is the condition. In this case it needs to have `TARGET_64BIT`
+defined in order to work because the machine mode is `DI` (we'll explain that
+soon).
+
+The output code is simple, just a RISC-V `add` instruction:
+
+``` asm
+add %0,%1,%2
+```
+
+Where `%N` is going to be replaced by the register numbers used as arguments
+for this instruction.
+
+The last field are the attributes, which can be used to define the instruction
+size and other kind of things. We are not going to focus on them today.
+
+##### Machine modes {#mm}
+
+Machine modes are a way to describe the size of a data object and its
+representation.
+
+- QI: quarter integer
+- HI: half integer
+- SI: single integer
+- DI: double integer
+- SF: single floating
+- DF: double floating
+
+And so on.
+
+The standard `insn` names include machine modes to describe what kind of
+instruction they are. The example above is `addddi3`, meaning it uses `di`
+machine mode: double integer. That's why it needs the target to be a 64 bit
+RISC-V machine.
+
+Machine modes also appear in some RTL expressions like `plus` or
+`match_operand` meaning that they operate in that machine mode, that is, with
+that data size and representation. For example `(plus:SI ...)`.
+
+##### RTL Templates {#rtl-templates}
+
+`match_*` expressions are what make RTL expressions *incomplete*, because they
+are designed to be compared against the `insn` list that comes from the
+previous step.
+
+In the example above we had:
+
+``` lisp
+(set (match_operand:DI 0 "register_operand" "=r,r")
+ (plus:DI (match_operand:DI 1 "register_operand" "r,r")
+ (match_operand:DI 2 "arith_operand" "r,I")))
+```
+
+`(match_operand N predicate constraint)` is a placeholder for an operand number
+`N` of the `insn`. When the `insn` is constructed, the `match_operand` will be
+replaced by the corresponding operand of the `insn`. When the template is
+trying to match an `insn` the `match_operand` forces the operand number `N` to
+match the `predicate` in order to make the `insn` match the template.
+The `match_*` expressions are what defines how `insn`s should be.
+
+The `predicate` is a function name to be called. The function receives two
+input arguments: an expression and a machine mode. If the function returns `0`
+the function does not match.
+
+`predicate`s can also be combined in Machine Description files like this:
+
+``` lisp
+(define_predicate "arith_operand"
+ (ior (match_operand 0 "const_arith_operand")
+ (match_operand 0 "register_operand")))
+```
+
+So the `arith_operand` shown in the example above can be a
+`const_arith_operand` *or* (that's what `ior` means) a `register_operand`.
+They can be more complex but this is more than enough to understand how they
+are built. In the end, they always check against C functions, but you can
+combine them with the convenience of the Machine Description files.
+
+The `constraint` allows to fine-tune the matching. They define if the argument
+is in a register or the memory and stuff like that. `r`, for example, means the
+operand comes from a register.
+
+
+
+There are other matching expressions too, but `match_operand` is the most used
+one and it's the one that explains this concept of *incomplete* expressions the
+best.
+
+
+
+#### Target description macros and functions {#target-desc}
+
+Apart from the machine descriptor files, there are other files involved. For
+example, the constraints defined above need to be defined in code somewhere.
+
+The most important of these are the target description macros and functions,
+normally defined in `gcc/config/$TARGET/$TARGET?(.h|.c)`. The `.c` should
+initialize the `targetm` variable, which contains all the machine information
+relevant to the compiler. It is initialized like this:
+
+``` c
+struct gcc_target targetm = TARGET_INITIALIZER;
+```
+
+That `TARGET_INITIALIZER` is a huge macro, defined in `gcc/target.h`, that
+initializes the `targetm` structure. This macro is split in smaller macros with
+reasonable defaults that may be overwritten by pieces. Each target should have
+a file that includes both `target.h` and `target-def.h` and overwrites any
+inappropriate default by redefining new macros and ends with the initialization
+line we just introduced. This is normally done in
+`gcc/config/$TARGET/$TARGET.c`, while the `.h` is normally used to define some
+macros that are needed in the `.c` file.
+
+As a reference, the RISC-V code we need to backport (see
+`gcc/config/riscv/riscv.c`) uses the file to introduce the amount of registers,
+the type, the size, and that kind of things, and many others.
+
+All this information contained in `targetm` is used by the compiler to decide
+how registers have to be allocated, which ones have preference, the cost of
+them, and many other things.
+
+### Assembly code generation {#assembly}
+
+Having the previous step clear is enough to understand how does the assembly
+generation work. Each of the `insn`s in the list obtained from GIMPLE is going
+to be compared against the RTL templates and the best match is going to be
+chosen. Once the match is chosen, the corresponding assembly is going to be
+generated from the corresponding field of the `define_insn` RTL expression.
+
+As simple as that, but also that complex.
+
+Why do I say it's complex? Because many things have to be considered and GCC
+does consider them. Each instruction has a size, that has to be considered to
+calculate addresses, but also they have some execution time associated and GCC
+calculates the best matches to make the final assembly file as optimum as
+possible.
+
+The RTL step has a lot of optimization passes, too. It's a complex step but
+it's not really important for us because we just need to make a temporary
+compiler that lets us compile a better one. It doesn't really matter if it's
+not perfect, at least at this point.
+
+
+### Summary {#summary}
+
+So, in summary, the process is the following:
+
+1. The HLL language is parsed to a tree, normally GENERIC.
+2. GENERIC is converted to GIMPLE.
+3. GIMPLE optimizations are applied.
+4. GIMPLE is matched to a `insn` list using pattern names.
+6. The `insn` list is matched against the RTL templates defined in the Machine
+ Description files.
+7. RTL optimizations are applied.
+8. The matches convert the RTL to assembly code also taking in account the
+ information obtained from the target definition macros and functions.
+
+From our perspective, the most important things to remember are these:
+
+- The front-end is not very relevant for us, from the parsing to GIMPLE we can
+ ignore for the moment.
+- RTL step is pretty complex, and the GIMPLE->RTL conversion is too.
+- GCC is a compiler collection that has a very powerful compilation process,
+ the Compiler Generator Framework (CFG), in order to modularize the code and
+ make it easier to port.
+- The machine description files and the target definition macros and functions
+ are designed to make the porting process simpler. Those are the only files we
+ need to touch.
+
+---
+
+<div style="
+ text-align: center;
+ font-size: smaller;
+ padding-left: 3em;
+ padding-right: 3em;
+ padding-top: 1em;
+ padding-bottom: 1em;
+ border-top: 1px solid var(--border-color);
+ border-bottom: 1px solid var(--border-color)">
+If you like my job here, consider hiring <a href="https://elenq.tech">ElenQ
+Technology</a>. <br> Even if I'm busy with this, I still have some time slots
+available.
+</div>
+
+---
+
+
+### My job in the backport {#job}
+
+With all the process more or less clear, we can be more specific on the job I
+need to do. I share some specifics in this section, so if you like reading code
+you are going to have some fun[^examples].
+
+[^examples]: I'll link to some examples of the code on RISC-V's GitHub account.
+ This code is already merged in GCC.
+
+First, I need to make sure all the used RTL expressions are compatible with the
+old version of the compiler. If they are not, I have to translate them to the
+old way to make them. Some examples of this are iterators like
+[`(define_int_iterator ...)`][int-iterator] which is not available in old GCC
+versions, so I need to unfold a couple of loops by hand and make them only use
+the old constructs.
+
+[int-iterator]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.md?plain=1#L342
+
+Second, I need to convert the target description macros and functions to the
+old internal C-based API instead of the more [modern C++-based one][c++api], as
+the recent port uses. These changes involve many layers and I didn't yet
+analyze this in detail. They can be simple like converting from `rtx_insn`
+class to `rtx`, the older way to do this. But they can also be complex, like
+removing the 40% of the `#include` directives from `riscv.c`, which has many
+that were not available in the past. It's going to be a lot of fun I predict.
+
+[c++api]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.c
+
+Third, as this whole compilation process is complex, I decided to make it as
+accessible as possible, so other people can audit and replicate my work. For
+that I'm using Guix, my package manager of choice. I added a [`guix.scm`][guix]
+and [`channels.scm`][channels] file to the repository so my work can be
+replicated precisely by myself in the future, or by others[^github].
+
+The Guix package also provides a better interaction with the building process
+of GCC, letting us replace inputs in a very simple way. I'm thinking about the
+next steps of the project here, when we need to compile my backported compiler
+with TinyCC, test if it works and then patch TinyCC until it does. Having the
+`guix.scm` file makes it easy to replace the current compiler with a patched
+TinyCC and ensures nothing is interfering in the compilation process because
+the compilation is done in an isolated container.
+
+[^github]: I'm hosting this on GitHub at the moment because the repository is
+ huge. I'll probably move all this to my server and edit the post after that.
+
+[guix]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/guix.scm
+[channels]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/channels.scm
+
+That's mostly the job I need to do in the backport.
+
+Something to keep in mind is that we don't need to make it perfect, we just
+need it to work. The backported GCC is not going to be used as a production
+compiler, but just as a bridge with the next GCC version, so **there's only one
+program it needs to be able to compile correctly: GCC 7**. Once we make the
+bridge with the next version, we can use that to compile anything we want.
+
+
+### Last words
+
+I know this post is long, and the lack proper diagrams make everything a little
+bit hard to understand. That's exactly how I felt reading about GCC, but the
+difference was I had to read some documentation that is... About 100 times
+longer than this post (see [Learn more](#more) below). Not that bad after all.
+
+There are many things I decided to leave out, like peephole optimizations,
+instruction attributes, and some other constructs that are not that important
+from my perspective. You may want to make your research on those on your own.
+
+In any case, if you have any question you can always contact me[^contact] and
+ask me any questions you have or send me some words of support.
+
+In the next post I'll describe a little bit about ELF, the executable and
+linkable format, just the bare minimum to understand the format, as it will be
+relevant for us in the future. And you might be thinking, why is it relevant if
+GCC compiles to assembly? Well, that's one of the questions that we will be
+answering in the next post.
+
+Now I leave you with a couple of interesting links on the next section.
+
+Good luck with your ports!
+
+
+[^contact]: You can find my contact info in the [About
+ page](/pages/about.html).
+
+
+
+### Learn more {#more}
+
+- [The GCC internals documentation](https://gcc.gnu.org/onlinedocs/gccint/): if
+ you are interested on my work you should read an older version of the
+ documentation. See [Disclaimer](#disclaimer).
+- [The GCC source code](https://gcc.gnu.org/git.html): of course, this has
+ everything you need to understand GCC, but the problem is that GCC is a huge
+ codebase, you probably need to spend months reading it in order to understand
+ everything. That's why I think posts like this one are interesting, they help
+ you focus on the parts you are interested in.
+