Title: ELF format — why not?
Date: 2022-03-14
Category:
Tags: Bootstrapping GCC in RISC-V
Slug: bootstrapGcc2
Lang: en
Summary:
    Some introduction to ELF as we'll need to deal with this in the future.

In the [previous post]({filename}01_internals.md) of the
[series]({tag}Bootstrapping GCC in RISC-V) we introduced GCC and how it
generates assembly code and we left a question unanswered: *"Why is learning
about ELF interesting if GCC generates assembly?"*. In this post we are going
to answer that question (not interesting) and maybe understand the very basics
of ELF file format (more interesting).


### What's ELF

ELF is a file format with two main goals:

- Represent an executable file
- Represent a linkable file

Apart from that, ELF can also represent core dumps, but if you think about that
all of the possible options have something in common: they represent contents
on the memory. We can simply say ELF is a file format that acts as a picture of
the state of the memory. In the case of the executables, the state will be
loaded from the file, but in the case of the core dumps the state is obtained
from the memory and dumped in a file.

Linkable files are those files that can be combined with others to generate
executables or shared objects, so they can also fit that definition because
they are going to end up in the memory anyway.

For efficiency reasons, the ELF format has two separate views of the same
contents:

- The **Linking** view is based on sections and needs a *section header*.
- The **Executable** view is based on segments and needs a *program header*.

#### ELF header

The ELF header is the only thing that has a fixed position in the file, at the
beginning. The ELF header has information that defines how to identify the
file, the machine, the endianness and that sort of things, but it also says
where are the headers located and identifies the size of their entries and
their entry count.

It's not that interesting, honestly. The most important thing is it points to
the descriptions to both of the views (the headers) so we can check them.

#### Linking view

Based on sections, the linking view is the most detailed view of the file and
it defines how the file should be linked with others in order to create an
executable file.

Sections, the basic unit of the linking view, are consecutive sequences of
bytes that do not overlap.

There are [different types of sections according to their possible contents and
meaning](https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA.junk/sections.html),
the most interesting are:

- `SYMTAB` and `DYNSYM` that hold a symbol table. The `DYNSYM` is for dynamic
  linking symbols, while `SYMTAB` normally is used for static linking but may
  contain both.
- `STRTAB` holds a string table.
- `RELA` contains relocation entries with addends and `REL` contains
  relocations without addends.
- `NOTE` section contains some information of the file.
- `HASH` contains a symbol hash table, necessary for dynamic linking.
- `DYNAMIC` for dynamic linking information.

Each section has also a `name`, an `address` if it is supposed to appear in the
memory of running process, an `offset` that defines where in the file do the
section's contents appear, a `size`, and some extra data fields that all
together form a section header entry.

The section header entries are all located where the ELF header says, one after
the other (like a C array of structures), so the programs just need to access
that position in the file and read all the headers in a row. The contents of
the sections are located throughout the file, where the section headers point.

##### String section

The string section (`STRTAB`) is one of the simplest. It contains all the
strings of the file: the section and symbol names. It's simply a set of null
terminated strings, written one after the other (it also starts with a null
character but whatever).

Anywhere in the file where we are supposed to get an string what we get is an
index that points to the first position in this section to read from. We should
read from that until we reach a null character. For example in the following
string section:

```
    \0 h e l l o \0 n a m e \0
```

If a name of a section says `1`, the actual name of the section is `hello`
and if it says `7` it would be `name`. Also, if it says `9` it would be `me`,
this trick could be used too.

##### Symbol table

The symbol table contains information needed to locate and relocate a program's
symbolic definitions and references. The symbol table is formed as an array of
symbol elements that are defined with a `name`, obviously a `value`, their
`size`, some extra `info`, the index of the section header they relate to
(`shndx`) and some `other` stuff.

The `info` field manages symbol's type (`OBJECT` for data, `FUNC` for
function...) and binding attributes, which define the linking visibility and
behavior of the symbol (local vs global...).

The `value` can be interpreted in several ways too, depending on the type of
the symbol you are dealing with. But that's not really relevant for us at the
moment.

##### Relocation

According to the ELF documentation I got from somewhere I don't really
remember:

> The relocation is the process of connecting symbolic references with symbolic
> definitions. 

I hope it's more explanatory for you than what it is to me, but I don't have a
clue of what that is supposed to mean. The
[Wikipedia](https://en.wikipedia.org/wiki/Relocation_(computing)) does a **much
better** job in the specifics right here:

> Relocation is the process of assigning load addresses for position-dependent
> code and data of a program and adjusting the code and data to reflect the
> assigned addresses.

If this doesn't really help, you have a really good example later, but we can
basically say that it's a way to adjust the code to point to the correct
addresses, at linking or loading, or even execution, time.

ELF files have, as we said, sections that let us define relocations. These will
point to some parts of the file and tell the linker or the loader that that
positions of the file must be reprocessed.

There are two types of relocation sections and in both of them the relocation
section is an array of entries where each of them represents one relocation.
In the simple one (`REL`) each relocation only contains an `offset` and an
`info` word, which also includes the type of relocation to apply. The more
complex one (`RELA`) is mostly the same but it includes an `addend` which
includes a constant value to use in calculation of the relocation.

The calculus of the final addresses are specific to the ISA and the relocation
type, because processors have different instruction formats and different ways
to pack addresses in instructions. RISC-V has no way to pack a full address
inside of an instruction, while x86 does, so they have to patch the
instructions in a different way.


##### Special sections

Some sections have a special treatment according to their name, normally the
ones that start with a dot. These you might have found in the past in assembly
files, defined like `.data` (for data), `.rodata` (for read only data) or
`.text` (for code).

These are interesting to have in mind because they appear the same way they do
in assembly, and we are going to disassemble some of them and play around with
them.

Other special sections like `.got` or `.dynamic` don't appear in assembly but
they have a strong meaning in the resulting file, we are not going to deal with
those today because we want to finish this post someday. If you need to deal
with those I recommend you to read ELF's documentation on special sections and
the loading process.


#### Executable view

The executable view is another way to access the same contents, but with a
different perspective. It's based on *segments* rather than *sections*.
Segments are also pieces of the file, as sections are, but segments can contain
one or more sections.

Like in the linking view, the base unit, sections for the linking view but for
segments for the executable view, are described in a header. The header of the
executable view is called program header and it is, like the section header, a
bunch of structures piled together, each describing one of the segments.

The program header describes the position and size in the file of each of the
segments but also some important information about them: how they are supposed
to be loaded in the memory and where (virtual address and physical address),
the type of the segment, and some info more.

The most interesting segment types are the following:

- `LOAD` is used for loadable segments, with the other fields of the segment
  the position and the size this segment will have in memory are described.
- `DYNAMIC` are segments that have some dynamic linking information. It has to
  contain the `.dynamic` section.
- `INTERP` gives the location and size of a null-terminated path name to invoke
  as an *interpreter*. Interpreter in this context usually means a dynamic
  linker, which will be called instead of loading this file to memory and the
  dynamic linker will be the one that will load the parts of the file it
  considers.

You can see how segments are interesting for loading the file in the memory,
that is, they are mostly interesting for executable files or shared objects.

#### Segments vs Sections

If you want to have a clear idea about the difference between segments and
sections, you can consider a file with multiple sections: `.text`, `.rodata`
and `.data`.

A file that contains those sections can be understood from a linking
perspective as a file that has some code (`.text`), read-only data (`.rodata`)
and read-write data (`.data`). Each of those parts must be managed in a
different way by the linker, but the reality is that the program loader doesn't
really care about some of the differences of them.

The code and the read-only data are loaded in the memory in the same way, with
read and execute permission but no write permission, so the executable view can
put both sections in the same segment, and make the loader's life easier.

Also, the linker doesn't really care about how is the memory loaded so the
section header does not hold that information. It does care about the section's
goals though, as it will need to put them together in order during the linking.
On the other hand, the loader is not really interested on what's the goal of
the contents of the file but only on what to do with those contents, so it only
has that information.


### So, why do we need to learn it?

We don't really need to learn it very deeply, just learn how it works in a
high-level way and make sure we are able to read it with the tools we have
available. The good news for you is if the reasons I give you are not good
enough it doesn't really matter because you already learned[^gotcha]. Continue
reading and you'll realize how much you understand now.

[^gotcha]: Ha! Gotcha!

First, let me tell you a personal story. I have previous experience working
with assembly, but only in small devices that have two memories, one for data
and other for code (Hardvard Architecture). In those small devices you often
don't really need to think about how the code and the data is mapped to memory
because your programs are small and the separation is clear. Computers are a
different thing, and I have had issues understanding this whole assembly thing.

Computers store both code and data in the same memory, the main memory, (Von
Neumann Architecture) and they normally have memory segmentation, pagination,
memory management units and all that kind of stuff, because there are many
processes running and they want to separate one from the other. That forces us
to think about how the code and the data are mapped to the memory. Also, modern
operating systems also use dynamic linkers, which are not available in small
devices, and we need to be able to deal with that amount of complexity.

ELF allows us to make that all, because it was born for that. ELF is a
distillation of many of the ideas from System V Unix, that include exactly all
I mentioned. It's a great way to understand how memory, linking and processes
work in a *modern* operating system. This is why you need to learn it, at least
a little. It makes you a cultivated person, which is always good[^system-v].

[^system-v]: It also makes you understand the complexities of the system so you
  can criticize it. Changing the world requires to learn about it first.

#### The specifics

As I'm sure you are not satisfied totally with the answer of being a cultivated
person[^some-of-you], let me go for some specifics.

[^some-of-you]: For those that really are. That's the good attitude in life.
  High five. You can read the whole section still, it has interesting points I
  think.

So in this project GCC is not the only software we are dealing with, GNU
Binutils and TinyCC are part of the party too, and I need to make them fit
together in the best way possible. In those I need to make sure the
relocations, formats and other things work properly, following the RISC-V ABI
specification for ELF. That might be a point of failure, so being prepared on a
high-level at least is interesting.

Of course, GCC's output we need to analyze too, and in order to do that we need
to make sure we know what it means. We already saw that some ELF sections are
directly mentioned in the assembly, so in order to know their meanings ELF is a
good way to understand them. They are really an OS related thing and ELF only
reflects it, but learning them from the ELF perspective makes the path easier
probably.

Relocations are a huge point in all this mess, because they are machine
specific (instructions are too, but those I expect us to know already), and
they are something I didn't need to research on all the RISC-V adventures I had
last year. I have to do it sometime.

In general, there are many sharp edges where we can get hurt, so it's better if
we wear gloves.

### Tools

For all this process there are a couple of tools that were designed to help.
GNU Binutils has many of them but we are going to focus on two, as they are
more than enough for many usecases: `objdump` and `readelf`.

The example below uses both of them to analyze a piece of code and its
compilation result. As you'll see, the main problem they have is their output:
it's not always clear, the formatting is a little bit chaotic, it's not
obvious at all to get right and it's really hard to use it procedurally.

There is a really cool tool you should investigate though, called GNU Poke,
that is designed specifically to fight against those issues. I recommend you to
[take a look to it](https://www.gnu.org/software/poke/).

### Example

Starting from a very simple C file we can follow a really interesting process
and understand some of the ELF internals:

``` c
long global_symbol;

int main() {
  return global_symbol != 0;
}
```

We compile it to assembly with:

``` asdf
$ riscv64-linux-gnu-gcc -S b.c -O0
```

This are the contents of the assembly file:

``` asm
        .file   "b.c"
        .option pic
        .text
        .globl  global_symbol
        .bss
        .align  3
        .type   global_symbol, @object
        .size   global_symbol, 8
global_symbol:
        .zero   8
        .text
        .align  1
        .globl  main
        .type   main, @function
main:
        addi    sp,sp,-16
        sd      s0,8(sp)
        addi    s0,sp,16
        lla     a5,global_symbol
        ld      a5,0(a5)
        snez    a5,a5
        andi    a5,a5,0xff
        sext.w  a5,a5
        mv      a0,a5
        ld      s0,8(sp)
        addi    sp,sp,16
        jr      ra
        .size   main, .-main
        .ident  "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
        .section        .note.GNU-stack,"",@progbits
```

Assemble the file with `as`:

``` asdf
$ riscv64-linux-gnu-as b.s -o b.o
```

And this is what we get in `b.o`. The `.text` section contains the following:

``` asm
$ riscv64-linux-gnu-objdump --disassemble b.o

b.o:     file format elf64-littleriscv


Disassembly of section .text:

0000000000000000 <main>:
   0:   ff010113        addi    sp,sp,-16
   4:   00813423        sd      s0,8(sp)
   8:   01010413        addi    s0,sp,16
   c:   00000797        auipc   a5,0x0
  10:   00078793        mv      a5,a5
  14:   0007b783        ld      a5,0(a5) # c <main+0xc>
  18:   00f037b3        snez    a5,a5
  1c:   0ff7f793        andi    a5,a5,255
  20:   0007879b        sext.w  a5,a5
  24:   00078513        mv      a0,a5
  28:   00813403        ld      s0,8(sp)
  2c:   01010113        addi    sp,sp,16
  30:   00008067        ret
```

### Relocations

There are some relocations!

``` asdf
$ riscv64-linux-gnu-objdump b.o -r

b.o:     file format elf64-littleriscv

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE                  VALUE
000000000000000c R_RISCV_PCREL_HI20    global_symbol
000000000000000c R_RISCV_RELAX         *ABS*
0000000000000010 R_RISCV_PCREL_LO12_I  .L0
0000000000000010 R_RISCV_RELAX         *ABS*
```

But in order to understand those relocations properly we need to check the
value of the symbols too:

``` asdf
$ riscv64-linux-gnu-objdump -t b.o

b.o:     file format elf64-littleriscv

SYMBOL TABLE:
0000000000000000 l    df *ABS*            0000000000000000 b.c
0000000000000000 l    d  .text            0000000000000000 .text
0000000000000000 l    d  .data            0000000000000000 .data
0000000000000000 l    d  .bss             0000000000000000 .bss
0000000000000000 l    d  .note.GNU-stack  0000000000000000 .note.GNU-stack
000000000000000c l       .text            0000000000000000 .L0 
0000000000000000 l    d  .comment         0000000000000000 .comment
0000000000000000 g     O .bss             0000000000000008 global_symbol
0000000000000000 g     F .text            0000000000000034 main
```

If you pay attention to the offsets of those relocations (`0x0c` and `0x10`)
they exactly match the instructions `auipc a5, 0x0` and `mv a5, a5` and those
are expanded from the `lla a5, global_symbol` (load local address)
pseudoinstruction from the assembly.

The `mv` is not really a `mv`. `mv` is a pseudoinstruction too, that should be
expanded to an `addi a5, a5, 0`. The `objdump` is playing with us, making the
opposite conversion so we can read better but in fact is tricking us.

The `auipc` + `addi` couple in RISC-V appears pretty often, because it's the
method it has to load addresses in memory. The first instruction, `auipc` adds
a high part of an immediate to the program counter and stores the result in a
register, the `addi` adds then another, in this case low, immediate to the
register i.e. they make a `x[reg] = pc + immediate` operation in two steps:
`x[reg] = pc + hi20(immediate)` followed by `x[reg] = x[reg] + lo12(immediate)`.

As we have relocations in both `auipc` and `addi` this means their `0` values
(the immediates) are going to be overwritten with something else at linking
time, and there's when RISC-V has something to say. All the relocations we can
see are RISC-V specific, and you can read about them in [RISC-V ABI
Specification](https://github.com/riscv-non-isa/riscv-elf-psabi-doc).

In our case we have really some simple ones, the easiest to understand (what a
coincidence, huh?):

> `R_RISCV_PCREL_HI20`: High 20 bits of 32-bit PC-relative reference,
> `%pcrel_hi(symbol)`. The formula is: `S+A-P` [but only obtains the highest 20
> bits].

> `R_RISCV_PCREL_LO12_I`: Low 12 bits of a 32-bit PC-relative,
> `%pcrel_lo(address of %pcrel_hi)`, the addend must be 0. The formula is:
> `S-P` [but it only obtains the lowest 12 bits].

Both the `HI20` and the `LO12` have a similar formula, this is the meaning of
the elements on the formula:

- `S`: Address of the symbol
- `A`: Addend of the relocation
- `P`: Position of the relocation

If you match their formulas with the description of what we just said about how
do `auipc` + `addi` couples work, you can easily understand the formulas and
their meaning. We are not going to do it, do something yourself!

The other relocation:

> `R_RISCV_RELAX`: Instruction can be relaxed, paired with a normal relocation
>  at the same address.

Is an addition our example doesn't use but it could. The `R_RISCV_RELAX`
basically means that if the relocation it points at is not needed it can be
discarded. And when does that happen? Easy, when we can get `global_symbol`'s
address with only one of them, we can remove the other instruction from the
program.

#### Relocation resolution

If we link the file and generate an executable, we can see the final value
those zeroes get.

``` asdf
$ riscv64-linux-gnu-gcc b.o -o b.out
```
We link it like this because `ld` needs a lot of input fields and we don't want
to set them all by hand, but you can do it with `ld` if you feel like it.

``` asdf
$ riscv64-linux-gnu-objdump --disassemble b.out
...
00000000000005e4 <main>:
 5e4:   ff010113        addi    sp,sp,-16
 5e8:   00813423        sd      s0,8(sp)
 5ec:   01010413        addi    s0,sp,16
 5f0:   00002797        auipc   a5,0x2
 5f4:   a6878793        addi    a5,a5,-1432 # 2058 <global_symbol>
 5f8:   0007b783        ld      a5,0(a5)
 5fc:   00f037b3        snez    a5,a5
 600:   0ff7f793        andi    a5,a5,255
 604:   0007879b        sext.w  a5,a5
 608:   00078513        mv      a0,a5
 60c:   00813403        ld      s0,8(sp)
 610:   01010113        addi    sp,sp,16
 614:   00008067        ret
...
```

There you see the relocation was resolved (`0x5f0` and `0x5f4`) by the linker
and the final values have been added. `objdump` is intelligent enough to tell
us where are those instructions pointing (says `2058 <global_symbol>`). Just to
make sure we can search in the symbol table for the `global_symbol`:

``` asdf
$ riscv64-linux-gnu-objdump -t b.out | grep global_symbol
0000000000002058 g     O .bss   0000000000000008              global_symbol
```

> NOTE: We could try to calculate the address of the `global_symbol` as the
> linker did, but it's a little bit complicated because we also linked the file
> with the standard library and the startup files, which adds the `crt` files
> on top of the file. It's really that we get more code than what we had in the
> assembly file. If you want to see that, you can see the rest of the output of
> the command, or even try with `--disassemble-all` and calculate the symbol
> address by hand. Good luck.

#### More sections

If you want the review some simple things, like a string section, you can use
`readelf` for that. The `-p` flag (equivalent to `--string-dump=`) displays the
contents of the section as strings. You can read the `.comment` section that
way:

``` asdf
$ riscv64-linux-gnu-readelf -p .comment b.o

String dump of section '.comment':
  [     1]  GCC: (Debian 10.2.1-6) 10.2.1 20210110
```

This is what we had inserted in `.ident` on the assembly file by the compiler.
We have it in the binary too.

In other distros the output is a little bit different. Look the output we have
in Guix:

``` asdf
String dump of section '.comment':
  [     1]  GCC: (GNU) 11.2.0
```

### Conclusion

So this whole this just to explain that ELF files are some kind of dual files
that have two different goals at the same time. The executable one is kind of a
picture of the memory state that can be used for loading that state in the
memory, while the linking one just describes how different parts of the
contents relate to each other and has tons of funny tricks to make the files
relocatable, position independent and that kind of things. Cool.

There are still many fields of ELF we didn't talk about but I consider this
introduction more than enough. Having a simple understanding about how is the
file organized and what kind of information it has is probably enough for the
things we are going to need.

The proposed example shows that with the knowledge obtained by this short
introduction we can dig a little bit on the files that result from a
compilation and analyze their internals. That's mostly the work I'll need to do
when I start combining compilers in a pipeline of death and destruction.

If I ever need to dig on something deeper, I'll do.

Anyway, I'm still unsure if I answered the question we left in the previous
post[^cliff]:

> Why is learning about ELF interesting if GCC generates assembly?

Did I?

[^cliff]: It was a good cliffhanger, though.