Title: GCC internals — From a porting perspective
Date: 2022-03-08
Category:
Tags: Bootstrapping GCC in RISC-V
Slug: bootstrapGcc1
Lang: en
Summary:
    Deep diving into GCC's internals from the perspective of someone who
    wants to port GCC for a new architecture.


In the [previous post]({filename}00_intro.md) of the
[series]({tag}Bootstrapping GCC in RISC-V) the problem of the GCC bootstrapping
was introduced. In this post we'll describe how GCC works, from the
perspective of someone who wants to port it so we understand what's the job we
have to do.

1. [Disclaimer](#disclaimer)
2. [Overview](#intro)
    1. [The compiler generation framework](#cgf)
    2. [GCC as a coordinator](#gcc-coordinator)
3. [Source code parsing](#parsing)
    1. [GENERIC](#generic)
4. [GIMPLE](#gimple)
5. [Register Transfer Language](#rtl)
    1. [Target-dependent code](#target-dependent)
    2. [Machine description files](#md)
        1. [Machine modes](#mm)
        2. [RTL Templates](#rtl-templates)
    3. [Target description macros and functions](#target-desc)
6. [Assembly code generation](#assembly)
7. [Summary](#summary)
8. [My job in the backport](#job)
9. [Last words](#last)
10. [Learn more](#more)

### Disclaimer {#disclaimer}

- This post may be only valid for old GCC versions, like 4.something, because
  that's the one I'm interested in. More recent versions may have different
  details, but I don't expect them to be very different to what is described
  here. More specifically: I'm working on GCC 4.6.4, and the first GCC with
  RISC-V support is GCC 7.0.0.
- This post will focus on how GCC compiles C programs because that's the part
  we care about. Some other languages have differences on how they are treated
  but that's not very relevant for us, as it has no implications on the
  *back-end*.

Both of these points will get clearer later.

### Overview {#intro}

GCC is structured as a pipeline of several steps that run one after the other.

1. Source code parsing
2. GIMPLE IR generation (target-independent)
3. Some GIMPLE tree optimizations
4. RTL IR generation (target-dependent)
5. RTL optimizer
6. Assembly code generator

Before starting to analyze each of the steps independently there are a couple
of things to clarify.

#### The Compiler Generation Framework {#cfg}

An important point to note is GCC is a **compiler collection** meaning that it
is able to compile code from many high level languages (HLL) and for many
different targets. This has implications on how some steps are mapped to GCC's
source code.

The most important thing of all this is to differentiate between GCC's code and
an actual `gcc` executable. The key point here is that GCC's codebase includes
what is called CGF (Compiler Generator Framework) that can generate `gcc`
executables from GCC's code. The CGF generates `gcc` executables according to
the input (target machine, host machine...) we give it, but the generated `gcc`
executables may differ one from another even if they were generated from the
same codebase.

Any `gcc` executable is able to compile any input HLL[^language] (C, C++,
Objective-C, Ada, Fortran and Go[^java]), so GCC's code must include parsers
for each of these languages.

[^language]: When calling `gcc` you can choose which language you are compiling
  using `-x language` option or you can let `gcc` guess from the extension.

[^java]: And also Java in the past!

On the other hand, `gcc` executables are only able to generate code for one
target (x86, MIPS, ARM, *RISC-V*...), that must be chosen when GCC is compiled.
In order to make the porting efforts easier, GCC has a set of tools that
generate the target-dependent code from some configuration files called Machine
Descriptions (MD).

Putting all this together, source code parsing and AST generation depend on the
input HLL, and the code that runs for each HLL is **selected** when `gcc` runs
(*steps 1*). The intermediate representation, GIMPLE, is target-independent so
everything related with that is **copied** inside the final `gcc` executable
(*steps 2 and 3*). The RTL (Register Transfer Language) representation and
assembly code generation are target-dependent and the code related to that is
**generated** from MD files when GCC is compiled (*steps 4, 5 and
6*)[^rtl-opt].

[^rtl-opt]: The RTL optimizer contains many steps, most of them being target
  independent. That doesn't really matter here, but those are not generated but
  copied from GCC's source, as GIMPLE is.

This all means if we want to be able to read the source code of GCC we have to
have clear in mind how the source code maps to the actual executable, i.e. if
we generate a `gcc` executable for **x86** it won't contain the code for other
architectures and **it won't even check if it was correctly programmed**,
because it's not going to compile it.


#### GCC as a coordinator {#gcc-coordinator}

Many GCC users or C programmers (or me, not that long ago) might think there is
something missing on the list of steps we recently reviewed. The normal usecase
for calling `gcc` like

``` bash
$ gcc -o helloworld helloworld.c
```

does several steps internally that we need to separate:

1. Preprocessing: the resolution of preprocessor macros like `#define` and
   stuff like that.
2. Compiling to assembly: the generation of assembly code files per compilation
   unit (a file that is the output of the preprocessor).
3. Assembly: the conversion from an assembly file to ELF object file.
4. Linking: the executable or library generation from the ELF object files
   created in the previous step.

The reality is there's more than one program involved here and `gcc` is just a
coordinator that makes other programs run if needed.

The preprocessor is called `cpp` and it is generated from the GCC codebase. The
compiler is `gcc` itself but the assembler and the linker are generally
obtained from GNU Binutils' `as` and `ld` respectively.

So, one of the most important things to understand is GCC only generates
assembly, but it looks like it doesn't[^tinycc-asm].

This means we need a proper support for our architecture on the assembler and
the linker too. But we'll keep that story for another day[^post-long].

[^tinycc-asm]: Other compilers have different approaches for this. For example,
  TinyCC generates machine code directly, without the intermediate assembly
  file generation step, and is also able to link the files by itself.

[^post-long]: This post is already long enough and we only made the
  introduction.

---

### Souce code parsing {#parsing}

> HLL dependent. Generates appropriate IR

So the first step of the compiler is to process the input text and convert it
to the appropriate Intermediate Representation. The most used intermediate
representation is GENERIC, that was designed for C but fits other procedural
languages pretty well[^fortran].

[^fortran]: The case of FORTRAN is a little bit weird, as it generates its own
  representation that is later converted to GENERIC, we don't really care about
  this at this point.

This parsing process not really relevant for us, as we want to add a new
target, but it's interesting to note because it gives shape to the codebase.
GCC splits the code for the different input languages in folders named like:
`gcc/$LANGUAGE`.

#### GENERIC {#generic}

GENERIC is just a representation, we don't need to care that much about it but
a word is not going to hurt anyone. GENERIC is a tree representation: a set of
nodes with some extra common information. Those nodes can be read at
`gcc/tree.def`.

A simple example of this could be a function declaration, that would take the
a node of type `FUNCTION_DECL` that has some sub-nodes: one for the return
type, another for the body of the function and another for the arguments of the
function.

It's a simple AST you could come up with yourselves, except the fact that it is
pretty complex. 😅

### GIMPLE {#gimple}

> HLL- and Target- independent representation

The next step is called *Gimplification* (see `gimplify.c`), the process of
converting to GIMPLE.  Normally, representing the AST as GIMPLE is too complex
to be done in one step, so the GENERIC (+ some extensions) is used as a
previous step that is easier to create.

GIMPLE is the central internal representation of GCC. It's target-independent
and High-Level-Language-independent. At this point some optimizations can be
applied, those related with the structure of the source code, like loop
unfolding or dead code elimination.

From the porting perspective, this representation is important, as it's the
border line between the front-end and the back-end, and we are interested in
the latter. A really interesting part is to understand is how is this converted
to the next representation, RTL.

### Register Transfer Language (RTL) {#rtl}

> Target-dependent low level representation

The next part of the compiler work is done using the RTL intermediate
representation. The RTL representation is based on LISP, so we have a reason to
love it, and it serves two purposes:

1. Specify target properties via the Machine Descriptor files. These Machine
   Descriptor files are text files that look like LISP and are processed at
   compilation time.
2. Represent a compilation. Meaning that the RTL is also an intermediate
   representation, a low-level one, that represents sets of instructions.

GCC does not make any distinction between the first and the second purpose,
calling both RTL, but there some difference on the purpose and the shape of the
RTL. RTL has both, an internal form represented by structures (case 2) and an
external form represented as a text file (case 1).

The RTL is formed by a set of objects: expression, integers, wide integers,
strings or vectors. In the textual form they are represented like in LISP,
using double quotes for strings, brackets for vectors... and a lot of
parenthesis. The internal representation you can imagine, structures for
expressions, integer types for integers, `char*` for strings, etc.

The most interesting RTL objects are expressions, aka RTX, that are just a name
(an expression code) plus the amount of possible arguments.

This is how a piece of RTL may look, it represents an instruction that sets the
register 0 to the result of the addition of the register 1 and the constant
integer 10 (see `rtl.def` for more information):

``` lisp
(set (reg  0)
     (plus (reg 1)
           (const_int 10)))
```

In the example the only things that are not expressions are the numbers (0, 1
and 10), all the rest you can find in `rtl.def` and see what they mean.

From GIMPLE, there are two steps left to reach our target, assembly code, and
both involve RTL. The first maps the GIMPLE nodes to pattern names in a
target-independent way, generating a list of RTL `insn`s. The second matches
those `insn` lists to RTL templates described in Machine Description files and
uses those matches to generate the final assembly code.

Those `insn`s are objects that represent code in RTL. Each function is
described with a doubly-linked list of `insn`s. You can think about them as
*instructions* in the RTL world.

In the first step, the RTL `insn` generation step, only the names matter (and
they are hardcoded in the compiler), while in the second the structure of the
`insn` is going to be analyzed as we'll see later.

#### Target-dependent code {#target-dependent}

As we previously said, target-dependent steps are generated at compile time and
then inserted in the final `gcc` executable. All this code is located in one
folder per target, under `gcc/config/$TARGET`, so the CFG is able to load the
target we choose at compile time (using `--target=`) and insert that in the
final executable.

That is done in different ways depending on the type of file we are working
with: Machine Description files are processed by the programs (`gencodes`,
`gentags`...) that generate C code files from them, while target description
macros and functions, which are C files, are inserted in the building process
as any other C file.

I'd like to insist here on the fact that the `--target` is the only one to be
processed and loaded and the other possible targets are going to be ignored.
The build process is not going to complain if a target is broken or anything
like that if it isn't the target we chose. It just doesn't care.

#### Machine Description files {#md}

Machine Description files (`.md` extension) let us define `insn` patterns,
which are incomplete RTL expressions that can be matched against the `insn`
list generated from the GIMPLE, `attributes` and other interesting things we
may not try to decipher here.

`define_insn` is a RTX we can use to define new `insn` patterns. It receives
four or five operands:

1. An optional name. It's going to be used to match against GIMPLE.
2. An RTL template. A vector of *incomplete* RTL expressions which describe how
   should the instruction look like. *Incomplete* in this context means it uses
   expressions like `match_operand` or `match_operator` which are designed to
   match against the RTL `insn` list and see if they are compatible or not.
3. A condition. A final condition to say if the `insn` matches this pattern or
   not.
4. An output template. A string that contains the output assembly code for this
   `insn`. The string can contain special characters like `%` to define where
   should the arguments be inserted. If the output is very complex we can write
   C code on this field too.
5. An optional list of attributes.

This is an actual example from the RISC-V code we are backporting:

``` lisp
(define_insn "adddi3"
  [(set (match_operand:DI 0 "register_operand" "=r,r")
        (plus:DI (match_operand:DI 1 "register_operand" "r,r")
                 (match_operand:DI 2 "arith_operand"    "r,I")))]
  "TARGET_64BIT"
  "add\t%0,%1,%2"
  [(set_attr "type" "arith")
   (set_attr "mode" "DI")])
```

You can see the name `adddi3` is something like: `add` + `di` + `3`. This means
it's the `add` instruction with the `di` mode and `3` input arguments. That's
the way things are named.

Next block is a vector with the RTL template. If you try to ignore
`match_operand` expressions you can see the template is not very different to
the RTL example we gave before. In this case it's something like:

``` lisp
(set (reg 0)
     (plus (reg 1)
           (reg 2)))
```

It's basically storing in the first register the result of the addition of the
other two.

The next field is the condition. In this case it needs to have `TARGET_64BIT`
defined in order to work because the machine mode is `DI` (we'll explain that
soon).

The output code is simple, just a RISC-V `add` instruction:

``` asm
add %0,%1,%2
```

Where `%N` is going to be replaced by the register numbers used as arguments
for this instruction.

The last field are the attributes, which can be used to define the instruction
size and other kind of things. We are not going to focus on them today.

##### Machine modes {#mm}

Machine modes are a way to describe the size of a data object and its
representation.

- QI: quarter integer
- HI: half integer
- SI: single integer
- DI: double integer
- SF: single floating
- DF: double floating

And so on.

The standard `insn` names include machine modes to describe what kind of
instruction they are. The example above is `addddi3`, meaning it uses `di`
machine mode: double integer. That's why it needs the target to be a 64 bit
RISC-V machine.

Machine modes also appear in some RTL expressions like `plus` or
`match_operand` meaning that they operate in that machine mode, that is, with
that data size and representation. For example `(plus:SI ...)`.

##### RTL Templates {#rtl-templates}

`match_*` expressions are what make RTL expressions *incomplete*, because they
are designed to be compared against the `insn` list that comes from the
previous step.

In the example above we had:

``` lisp
(set (match_operand:DI 0 "register_operand" "=r,r")
     (plus:DI (match_operand:DI 1 "register_operand" "r,r")
              (match_operand:DI 2 "arith_operand" "r,I")))
```

`(match_operand N predicate constraint)` is a placeholder for an operand number
`N` of the `insn`. When the `insn` is constructed, the `match_operand` will be
replaced by the corresponding operand of the `insn`. When the template is
trying to match an `insn` the `match_operand` forces the operand number `N` to
match the `predicate` in order to make the `insn` match the template.
The `match_*` expressions are what defines how `insn`s should be.

The `predicate` is a function name to be called. The function receives two
input arguments: an expression and a machine mode. If the function returns `0`
the function does not match.

`predicate`s can also be combined in Machine Description files like this:

``` lisp
(define_predicate "arith_operand"
  (ior (match_operand 0 "const_arith_operand")
       (match_operand 0 "register_operand")))
```

So the `arith_operand` shown in the example above can be a
`const_arith_operand` *or* (that's what `ior` means) a `register_operand`.
They can be more complex but this is more than enough to understand how they
are built. In the end, they always check against C functions, but you can
combine them with the convenience of the Machine Description files.

The `constraint` allows to fine-tune the matching. They define if the argument
is in a register or the memory and stuff like that. `r`, for example, means the
operand comes from a register.


There are other matching expressions too, but `match_operand` is the most used
one and it's the one that explains this concept of *incomplete* expressions the
best.


#### Target description macros and functions {#target-desc}

Apart from the machine descriptor files, there are other files involved. For
example, the constraints defined above need to be defined in code somewhere.

The most important of these are the target description macros and functions,
normally defined in `gcc/config/$TARGET/$TARGET?(.h|.c)`. The `.c` should
initialize the `targetm` variable, which contains all the machine information
relevant to the compiler. It is initialized like this:

``` c
struct gcc_target targetm = TARGET_INITIALIZER;
```

That `TARGET_INITIALIZER` is a huge macro, defined in `gcc/target.h`, that
initializes the `targetm` structure. This macro is split in smaller macros with
reasonable defaults that may be overwritten by pieces. Each target should have
a file that includes both `target.h` and `target-def.h` and overwrites any
inappropriate default by redefining new macros and ends with the initialization
line we just introduced. This is normally done in
`gcc/config/$TARGET/$TARGET.c`, while the `.h` is normally used to define some
macros that are needed in the `.c` file.

As a reference, the RISC-V code we need to backport (see
`gcc/config/riscv/riscv.c`) uses the file to introduce the amount of registers,
the type, the size, and that kind of things, and many others.

All this information contained in `targetm` is used by the compiler to decide
how registers have to be allocated, which ones have preference, the cost of
them, and many other things.

### Assembly code generation {#assembly}

Having the previous step clear is enough to understand how does the assembly
generation work. Each of the `insn`s in the list obtained from GIMPLE is going
to be compared against the RTL templates and the best match is going to be
chosen. Once the match is chosen, the corresponding assembly is going to be
generated from the corresponding field of the `define_insn` RTL expression.

As simple as that, but also that complex.

Why do I say it's complex? Because many things have to be considered and GCC
does consider them. Each instruction has a size, that has to be considered to
calculate addresses, but also they have some execution time associated and GCC
calculates the best matches to make the final assembly file as optimum as
possible.

The RTL step has a lot of optimization passes, too. It's a complex step but
it's not really important for us because we just need to make a temporary
compiler that lets us compile a better one. It doesn't really matter if it's
not perfect, at least at this point.


### Summary {#summary}

So, in summary, the process is the following:

1. The HLL language is parsed to a tree, normally GENERIC.
2. GENERIC is converted to GIMPLE.
3. GIMPLE optimizations are applied.
4. GIMPLE is matched to a `insn` list using pattern names.
6. The `insn` list is matched against the RTL templates defined in the Machine
   Description files. 
7. RTL optimizations are applied.
8. The matches convert the RTL to assembly code also taking in account the
   information obtained from the target definition macros and functions.

From our perspective, the most important things to remember are these:

- The front-end is not very relevant for us, from the parsing to GIMPLE we can
  ignore for the moment.
- RTL step is pretty complex, and the GIMPLE->RTL conversion is too.
- GCC is a compiler collection that has a very powerful compilation process,
  the Compiler Generator Framework (CFG), in order to modularize the code and
  make it easier to port.
- The machine description files and the target definition macros and functions
  are designed to make the porting process simpler. Those are the only files we
  need to touch.

---

<div style="
    text-align: center;
    font-size: smaller;
    padding-left: 3em;
    padding-right: 3em;
    padding-top: 1em;
    padding-bottom: 1em;
    border-top: 1px solid var(--border-color);
    border-bottom: 1px solid var(--border-color)">
If you like my job here, consider hiring <a href="https://elenq.tech">ElenQ
Technology</a>. <br> Even if I'm busy with this, I still have some time slots
available.
</div>

---


### My job in the backport {#job}

With all the process more or less clear, we can be more specific on the job I
need to do. I share some specifics in this section, so if you like reading code 
you are going to have some fun[^examples].

[^examples]: I'll link to some examples of the code on RISC-V's GitHub account.
  This code is already merged in GCC.

First, I need to make sure all the used RTL expressions are compatible with the
old version of the compiler. If they are not, I have to translate them to the
old way to make them. Some examples of this are iterators like
[`(define_int_iterator ...)`][int-iterator] which is not available in old GCC
versions, so I need to unfold a couple of loops by hand and make them only use
the old constructs.

[int-iterator]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.md?plain=1#L342

Second, I need to convert the target description macros and functions to the
old internal C-based API instead of the more [modern C++-based one][c++api], as
the recent port uses. These changes involve many layers and I didn't yet
analyze this in detail. They can be simple like converting from `rtx_insn`
class to `rtx`, the older way to do this. But they can also be complex, like
removing the 40% of the `#include` directives from `riscv.c`, which has many
that were not available in the past.  It's going to be a lot of fun I predict.

[c++api]: https://github.com/riscv-collab/riscv-gcc/blob/ca312387ab141060c20c388d83d6fc4b2099af1d/gcc/config/riscv/riscv.c

Third, as this whole compilation process is complex, I decided to make it as
accessible as possible, so other people can audit and replicate my work. For
that I'm using Guix, my package manager of choice. I added a [`guix.scm`][guix]
and [`channels.scm`][channels] file to the repository so my work can be
replicated precisely by myself in the future, or by others[^github].

The Guix package also provides a better interaction with the building process
of GCC, letting us replace inputs in a very simple way. I'm thinking about the
next steps of the project here, when we need to compile my backported compiler
with TinyCC, test if it works and then patch TinyCC until it does. Having the
`guix.scm` file makes it easy to replace the current compiler with a patched
TinyCC and ensures nothing is interfering in the compilation process because
the compilation is done in an isolated container.

[^github]: I'm hosting this on GitHub at the moment because the repository is
  huge. I'll probably move all this to my server and edit the post after that.

[guix]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/guix.scm
[channels]: https://github.com/ekaitz-zarraga/gcc/blob/guix_package/channels.scm

That's mostly the job I need to do in the backport.

Something to keep in mind is that we don't need to make it perfect, we just
need it to work. The backported GCC is not going to be used as a production
compiler, but just as a bridge with the next GCC version, so **there's only one
program it needs to be able to compile correctly: GCC 7**. Once we make the
bridge with the next version, we can use that to compile anything we want.


### Last words

I know this post is long, and the lack proper diagrams make everything a little
bit hard to understand. That's exactly how I felt reading about GCC, but the
difference was I had to read some documentation that is... About 100 times
longer than this post (see [Learn more](#more) below). Not that bad after all.

There are many things I decided to leave out, like peephole optimizations,
instruction attributes, and some other constructs that are not that important
from my perspective. You may want to make your research on those on your own.

In any case, if you have any question you can always contact me[^contact] and
ask me any questions you have or send me some words of support.

In the next post I'll describe a little bit about ELF, the executable and
linkable format, just the bare minimum to understand the format, as it will be
relevant for us in the future. And you might be thinking, why is it relevant if
GCC compiles to assembly?  Well, that's one of the questions that we will be
answering in the next post.

Now I leave you with a couple of interesting links on the next section.

Good luck with your ports!


[^contact]: You can find my contact info in the [About
  page](/pages/about.html).


### Learn more {#more}

- [The GCC internals documentation](https://gcc.gnu.org/onlinedocs/gccint/): if
  you are interested on my work you should read an older version of the
  documentation. See [Disclaimer](#disclaimer).
- [The GCC source code](https://gcc.gnu.org/git.html): of course, this has
  everything you need to understand GCC, but the problem is that GCC is a huge
  codebase, you probably need to spend months reading it in order to understand
  everything. That's why I think posts like this one are interesting, they help
  you focus on the parts you are interested in.