From 57ebc576bf74ca887f9c77cb6b7771b9ff2b1843 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Mon, 14 Mar 2022 23:52:34 +0100
Subject: NlNet project posts 0-2

---
 content/bootstrapGcc/02_elf.md | 595 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 595 insertions(+)
 create mode 100644 content/bootstrapGcc/02_elf.md

(limited to 'content/bootstrapGcc/02_elf.md')

diff --git a/content/bootstrapGcc/02_elf.md b/content/bootstrapGcc/02_elf.md
new file mode 100644
index 0000000..c684c63
--- /dev/null
+++ b/content/bootstrapGcc/02_elf.md
@@ -0,0 +1,595 @@
+Title: ELF format — why not?
+Date: 2022-03-14
+Category:
+Tags: Bootstrapping GCC in RISC-V
+Slug: bootstrapGcc2
+Lang: en
+Summary:
+    Some introduction to ELF as we'll need to deal with this in the future.
+
+In the [previous post]({filename}01_internals.md) of the
+[series]({tag}Bootstrapping GCC in RISC-V) we introduced GCC and how it
+generates assembly code and we left a question unanswered: *"Why is learning
+about ELF interesting if GCC generates assembly?"*. In this post we are going
+to answer that question (not interesting) and maybe understand the very basics
+of ELF file format (more interesting).
+
+
+### What's ELF
+
+ELF is a file format with two main goals:
+
+- Represent an executable file
+- Represent a linkable file
+
+Apart from that, ELF can also represent core dumps, but if you think about that
+all of the possible options have something in common: they represent contents
+on the memory. We can simply say ELF is a file format that acts as a picture of
+the state of the memory. In the case of the executables, the state will be
+loaded from the file, but in the case of the core dumps the state is obtained
+from the memory and dumped in a file.
+
+Linkable files are those files that can be combined with others to generate
+executables or shared objects, so they can also fit that definition because
+they are going to end up in the memory anyway.
+
+For efficiency reasons, the ELF format has two separate views of the same
+contents:
+
+- The **Linking** view is based on sections and needs a *section header*.
+- The **Executable** view is based on segments and needs a *program header*.
+
+#### ELF header
+
+The ELF header is the only thing that has a fixed position in the file, at the
+beginning. The ELF header has information that defines how to identify the
+file, the machine, the endianness and that sort of things, but it also says
+where are the headers located and identifies the size of their entries and
+their entry count.
+
+It's not that interesting, honestly. The most important thing is it points to
+the descriptions to both of the views (the headers) so we can check them.
+
+#### Linking view
+
+Based on sections, the linking view is the most detailed view of the file and
+it defines how the file should be linked with others in order to create an
+executable file.
+
+Sections, the basic unit of the linking view, are consecutive sequences of
+bytes that do not overlap.
+
+There are [different types of sections according to their possible contents and
+meaning](https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA.junk/sections.html),
+the most interesting are:
+
+- `SYMTAB` and `DYNSYM` that hold a symbol table. The `DYNSYM` is for dynamic
+  linking symbols, while `SYMTAB` normally is used for static linking but may
+  contain both.
+- `STRTAB` holds a string table.
+- `RELA` contains relocation entries with addends and `REL` contains
+  relocations without addends.
+- `NOTE` section contains some information of the file.
+- `HASH` contains a symbol hash table, necessary for dynamic linking.
+- `DYNAMIC` for dynamic linking information.
+
+Each section has also a `name`, an `address` if it is supposed to appear in the
+memory of running process, an `offset` that defines where in the file do the
+section's contents appear, a `size`, and some extra data fields that all
+together form a section header entry.
+
+The section header entries are all located where the ELF header says, one after
+the other (like a C array of structures), so the programs just need to access
+that position in the file and read all the headers in a row. The contents of
+the sections are located throughout the file, where the section headers point.
+
+##### String section
+
+The string section (`STRTAB`) is one of the simplest. It contains all the
+strings of the file: the section and symbol names. It's simply a set of null
+terminated strings, written one after the other (it also starts with a null
+character but whatever).
+
+Anywhere in the file where we are supposed to get an string what we get is an
+index that points to the first position in this section to read from. We should
+read from that until we reach a null character. For example in the following
+string section:
+
+```
+    \0 h e l l o \0 n a m e \0
+```
+
+If a name of a section says `1`, the actual name of the section is `hello`
+and if it says `7` it would be `name`. Also, if it says `9` it would be `me`,
+this trick could be used too.
+
+##### Symbol table
+
+The symbol table contains information needed to locate and relocate a program's
+symbolic definitions and references. The symbol table is formed as an array of
+symbol elements that are defined with a `name`, obviously a `value`, their
+`size`, some extra `info`, the index of the section header they relate to
+(`shndx`) and some `other` stuff.
+
+The `info` field manages symbol's type (`OBJECT` for data, `FUNC` for
+function...) and binding attributes, which define the linking visibility and
+behavior of the symbol (local vs global...).
+
+The `value` can be interpreted in several ways too, depending on the type of
+the symbol you are dealing with. But that's not really relevant for us at the
+moment.
+
+##### Relocation
+
+According to the ELF documentation I got from somewhere I don't really
+remember:
+
+> The relocation is the process of connecting symbolic references with symbolic
+> definitions. 
+
+I hope it's more explanatory for you than what it is to me, but I don't have a
+clue of what that is supposed to mean. The
+[Wikipedia](https://en.wikipedia.org/wiki/Relocation_(computing)) does a **much
+better** job in the specifics right here:
+
+> Relocation is the process of assigning load addresses for position-dependent
+> code and data of a program and adjusting the code and data to reflect the
+> assigned addresses.
+
+If this doesn't really help, you have a really good example later, but we can
+basically say that it's a way to adjust the code to point to the correct
+addresses, at linking or loading, or even execution, time.
+
+ELF files have, as we said, sections that let us define relocations. These will
+point to some parts of the file and tell the linker or the loader that that
+positions of the file must be reprocessed.
+
+There are two types of relocation sections and in both of them the relocation
+section is an array of entries where each of them represents one relocation.
+In the simple one (`REL`) each relocation only contains an `offset` and an
+`info` word, which also includes the type of relocation to apply. The more
+complex one (`RELA`) is mostly the same but it includes an `addend` which
+includes a constant value to use in calculation of the relocation.
+
+The calculus of the final addresses are specific to the ISA and the relocation
+type, because processors have different instruction formats and different ways
+to pack addresses in instructions. RISC-V has no way to pack a full address
+inside of an instruction, while x86 does, so they have to patch the
+instructions in a different way.
+
+
+##### Special sections
+
+Some sections have a special treatment according to their name, normally the
+ones that start with a dot. These you might have found in the past in assembly
+files, defined like `.data` (for data), `.rodata` (for read only data) or
+`.text` (for code).
+
+These are interesting to have in mind because they appear the same way they do
+in assembly, and we are going to disassemble some of them and play around with
+them.
+
+Other special sections like `.got` or `.dynamic` don't appear in assembly but
+they have a strong meaning in the resulting file, we are not going to deal with
+those today because we want to finish this post someday. If you need to deal
+with those I recommend you to read ELF's documentation on special sections and
+the loading process.
+
+
+#### Executable view
+
+The executable view is another way to access the same contents, but with a
+different perspective. It's based on *segments* rather than *sections*.
+Segments are also pieces of the file, as sections are, but segments can contain
+one or more sections.
+
+Like in the linking view, the base unit, sections for the linking view but for
+segments for the executable view, are described in a header. The header of the
+executable view is called program header and it is, like the section header, a
+bunch of structures piled together, each describing one of the segments.
+
+The program header describes the position and size in the file of each of the
+segments but also some important information about them: how they are supposed
+to be loaded in the memory and where (virtual address and physical address),
+the type of the segment, and some info more.
+
+The most interesting segment types are the following:
+
+- `LOAD` is used for loadable segments, with the other fields of the segment
+  the position and the size this segment will have in memory are described.
+- `DYNAMIC` are segments that have some dynamic linking information. It has to
+  contain the `.dynamic` section.
+- `INTERP` gives the location and size of a null-terminated path name to invoke
+  as an *interpreter*. Interpreter in this context usually means a dynamic
+  linker, which will be called instead of loading this file to memory and the
+  dynamic linker will be the one that will load the parts of the file it
+  considers.
+
+You can see how segments are interesting for loading the file in the memory,
+that is, they are mostly interesting for executable files or shared objects.
+
+#### Segments vs Sections
+
+If you want to have a clear idea about the difference between segments and
+sections, you can consider a file with multiple sections: `.text`, `.rodata`
+and `.data`.
+
+A file that contains those sections can be understood from a linking
+perspective as a file that has some code (`.text`), read-only data (`.rodata`)
+and read-write data (`.data`). Each of those parts must be managed in a
+different way by the linker, but the reality is that the program loader doesn't
+really care about some of the differences of them.
+
+The code and the read-only data are loaded in the memory in the same way, with
+read and execute permission but no write permission, so the executable view can
+put both sections in the same segment, and make the loader's life easier.
+
+Also, the linker doesn't really care about how is the memory loaded so the
+section header does not hold that information. It does care about the section's
+goals though, as it will need to put them together in order during the linking.
+On the other hand, the loader is not really interested on what's the goal of
+the contents of the file but only on what to do with those contents, so it only
+has that information.
+
+
+### So, why do we need to learn it?
+
+We don't really need to learn it very deeply, just learn how it works in a
+high-level way and make sure we are able to read it with the tools we have
+available. The good news for you is if the reasons I give you are not good
+enough it doesn't really matter because you already learned[^gotcha]. Continue
+reading and you'll realize how much you understand now.
+
+[^gotcha]: Ha! Gotcha!
+
+First, let me tell you a personal story. I have previous experience working
+with assembly, but only in small devices that have two memories, one for data
+and other for code (Hardvard Architecture). In those small devices you often
+don't really need to think about how the code and the data is mapped to memory
+because your programs are small and the separation is clear. Computers are a
+different thing, and I have had issues understanding this whole assembly thing.
+
+Computers store both code and data in the same memory, the main memory, (Von
+Neumann Architecture) and they normally have memory segmentation, pagination,
+memory management units and all that kind of stuff, because there are many
+processes running and they want to separate one from the other. That forces us
+to think about how the code and the data are mapped to the memory. Also, modern
+operating systems also use dynamic linkers, which are not available in small
+devices, and we need to be able to deal with that amount of complexity.
+
+ELF allows us to make that all, because it was born for that. ELF is a
+distillation of many of the ideas from System V Unix, that include exactly all
+I mentioned. It's a great way to understand how memory, linking and processes
+work in a *modern* operating system. This is why you need to learn it, at least
+a little. It makes you a cultivated person, which is always good[^system-v].
+
+[^system-v]: It also makes you understand the complexities of the system so you
+  can criticize it. Changing the world requires to learn about it first.
+
+#### The specifics
+
+As I'm sure you are not satisfied totally with the answer of being a cultivated
+person[^some-of-you], let me go for some specifics.
+
+[^some-of-you]: For those that really are. That's the good attitude in life.
+  High five. You can read the whole section still, it has interesting points I
+  think.
+
+So in this project GCC is not the only software we are dealing with, GNU
+Binutils and TinyCC are part of the party too, and I need to make them fit
+together in the best way possible. In those I need to make sure the
+relocations, formats and other things work properly, following the RISC-V ABI
+specification for ELF. That might be a point of failure, so being prepared on a
+high-level at least is interesting.
+
+Of course, GCC's output we need to analyze too, and in order to do that we need
+to make sure we know what it means. We already saw that some ELF sections are
+directly mentioned in the assembly, so in order to know their meanings ELF is a
+good way to understand them. They are really an OS related thing and ELF only
+reflects it, but learning them from the ELF perspective makes the path easier
+probably.
+
+Relocations are a huge point in all this mess, because they are machine
+specific (instructions are too, but those I expect us to know already), and
+they are something I didn't need to research on all the RISC-V adventures I had
+last year. I have to do it sometime.
+
+In general, there are many sharp edges where we can get hurt, so it's better if
+we wear gloves.
+
+### Tools
+
+For all this process there are a couple of tools that were designed to help.
+GNU Binutils has many of them but we are going to focus on two, as they are
+more than enough for many usecases: `objdump` and `readelf`.
+
+The example below uses both of them to analyze a piece of code and its
+compilation result. As you'll see, the main problem they have is their output:
+it's not always clear, the formatting is a little bit chaotic, it's not
+obvious at all to get right and it's really hard to use it procedurally.
+
+There is a really cool tool you should investigate though, called GNU Poke,
+that is designed specifically to fight against those issues. I recommend you to
+[take a look to it](https://www.gnu.org/software/poke/).
+
+### Example
+
+Starting from a very simple C file we can follow a really interesting process
+and understand some of the ELF internals:
+
+``` c
+long global_symbol;
+
+int main() {
+  return global_symbol != 0;
+}
+```
+
+We compile it to assembly with:
+
+``` asdf
+$ riscv64-linux-gnu-gcc -S b.c -O0
+```
+
+This are the contents of the assembly file:
+
+``` asm
+        .file   "b.c"
+        .option pic
+        .text
+        .globl  global_symbol
+        .bss
+        .align  3
+        .type   global_symbol, @object
+        .size   global_symbol, 8
+global_symbol:
+        .zero   8
+        .text
+        .align  1
+        .globl  main
+        .type   main, @function
+main:
+        addi    sp,sp,-16
+        sd      s0,8(sp)
+        addi    s0,sp,16
+        lla     a5,global_symbol
+        ld      a5,0(a5)
+        snez    a5,a5
+        andi    a5,a5,0xff
+        sext.w  a5,a5
+        mv      a0,a5
+        ld      s0,8(sp)
+        addi    sp,sp,16
+        jr      ra
+        .size   main, .-main
+        .ident  "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
+        .section        .note.GNU-stack,"",@progbits
+```
+
+Assemble the file with `as`:
+
+``` asdf
+$ riscv64-linux-gnu-as b.s -o b.o
+```
+
+And this is what we get in `b.o`. The `.text` section contains the following:
+
+``` asm
+$ riscv64-linux-gnu-objdump --disassemble b.o
+
+b.o:     file format elf64-littleriscv
+
+
+Disassembly of section .text:
+
+0000000000000000 <main>:
+   0:   ff010113        addi    sp,sp,-16
+   4:   00813423        sd      s0,8(sp)
+   8:   01010413        addi    s0,sp,16
+   c:   00000797        auipc   a5,0x0
+  10:   00078793        mv      a5,a5
+  14:   0007b783        ld      a5,0(a5) # c <main+0xc>
+  18:   00f037b3        snez    a5,a5
+  1c:   0ff7f793        andi    a5,a5,255
+  20:   0007879b        sext.w  a5,a5
+  24:   00078513        mv      a0,a5
+  28:   00813403        ld      s0,8(sp)
+  2c:   01010113        addi    sp,sp,16
+  30:   00008067        ret
+```
+
+### Relocations
+
+There are some relocations!
+
+``` asdf
+$ riscv64-linux-gnu-objdump b.o -r
+
+b.o:     file format elf64-littleriscv
+
+RELOCATION RECORDS FOR [.text]:
+OFFSET           TYPE                  VALUE
+000000000000000c R_RISCV_PCREL_HI20    global_symbol
+000000000000000c R_RISCV_RELAX         *ABS*
+0000000000000010 R_RISCV_PCREL_LO12_I  .L0
+0000000000000010 R_RISCV_RELAX         *ABS*
+```
+
+But in order to understand those relocations properly we need to check the
+value of the symbols too:
+
+``` asdf
+$ riscv64-linux-gnu-objdump -t b.o
+
+b.o:     file format elf64-littleriscv
+
+SYMBOL TABLE:
+0000000000000000 l    df *ABS*            0000000000000000 b.c
+0000000000000000 l    d  .text            0000000000000000 .text
+0000000000000000 l    d  .data            0000000000000000 .data
+0000000000000000 l    d  .bss             0000000000000000 .bss
+0000000000000000 l    d  .note.GNU-stack  0000000000000000 .note.GNU-stack
+000000000000000c l       .text            0000000000000000 .L0 
+0000000000000000 l    d  .comment         0000000000000000 .comment
+0000000000000000 g     O .bss             0000000000000008 global_symbol
+0000000000000000 g     F .text            0000000000000034 main
+```
+
+If you pay attention to the offsets of those relocations (`0x0c` and `0x10`)
+they exactly match the instructions `auipc a5, 0x0` and `mv a5, a5` and those
+are expanded from the `lla a5, global_symbol` (load local address)
+pseudoinstruction from the assembly.
+
+The `mv` is not really a `mv`. `mv` is a pseudoinstruction too, that should be
+expanded to an `addi a5, a5, 0`. The `objdump` is playing with us, making the
+opposite conversion so we can read better but in fact is tricking us.
+
+The `auipc` + `addi` couple in RISC-V appears pretty often, because it's the
+method it has to load addresses in memory. The first instruction, `auipc` adds
+a high part of an immediate to the program counter and stores the result in a
+register, the `addi` adds then another, in this case low, immediate to the
+register i.e. they make a `x[reg] = pc + immediate` operation in two steps:
+`x[reg] = pc + hi20(immediate)` followed by `x[reg] = x[reg] + lo12(immediate)`.
+
+As we have relocations in both `auipc` and `addi` this means their `0` values
+(the immediates) are going to be overwritten with something else at linking
+time, and there's when RISC-V has something to say. All the relocations we can
+see are RISC-V specific, and you can read about them in [RISC-V ABI
+Specification](https://github.com/riscv-non-isa/riscv-elf-psabi-doc).
+
+In our case we have really some simple ones, the easiest to understand (what a
+coincidence, huh?):
+
+> `R_RISCV_PCREL_HI20`: High 20 bits of 32-bit PC-relative reference,
+> `%pcrel_hi(symbol)`. The formula is: `S+A-P` [but only obtains the highest 20
+> bits].
+
+> `R_RISCV_PCREL_LO12_I`: Low 12 bits of a 32-bit PC-relative,
+> `%pcrel_lo(address of %pcrel_hi)`, the addend must be 0. The formula is:
+> `S-P` [but it only obtains the lowest 12 bits].
+
+Both the `HI20` and the `LO12` have a similar formula, this is the meaning of
+the elements on the formula:
+
+- `S`: Address of the symbol
+- `A`: Addend of the relocation
+- `P`: Position of the relocation
+
+If you match their formulas with the description of what we just said about how
+do `auipc` + `addi` couples work, you can easily understand the formulas and
+their meaning. We are not going to do it, do something yourself!
+
+The other relocation:
+
+> `R_RISCV_RELAX`: Instruction can be relaxed, paired with a normal relocation
+>  at the same address.
+
+Is an addition our example doesn't use but it could. The `R_RISCV_RELAX`
+basically means that if the relocation it points at is not needed it can be
+discarded. And when does that happen? Easy, when we can get `global_symbol`'s
+address with only one of them, we can remove the other instruction from the
+program.
+
+#### Relocation resolution
+
+If we link the file and generate an executable, we can see the final value
+those zeroes get.
+
+``` asdf
+$ riscv64-linux-gnu-gcc b.o -o b.out
+```
+We link it like this because `ld` needs a lot of input fields and we don't want
+to set them all by hand, but you can do it with `ld` if you feel like it.
+
+``` asdf
+$ riscv64-linux-gnu-objdump --disassemble b.out
+...
+00000000000005e4 <main>:
+ 5e4:   ff010113        addi    sp,sp,-16
+ 5e8:   00813423        sd      s0,8(sp)
+ 5ec:   01010413        addi    s0,sp,16
+ 5f0:   00002797        auipc   a5,0x2
+ 5f4:   a6878793        addi    a5,a5,-1432 # 2058 <global_symbol>
+ 5f8:   0007b783        ld      a5,0(a5)
+ 5fc:   00f037b3        snez    a5,a5
+ 600:   0ff7f793        andi    a5,a5,255
+ 604:   0007879b        sext.w  a5,a5
+ 608:   00078513        mv      a0,a5
+ 60c:   00813403        ld      s0,8(sp)
+ 610:   01010113        addi    sp,sp,16
+ 614:   00008067        ret
+...
+```
+
+There you see the relocation was resolved (`0x5f0` and `0x5f4`) by the linker
+and the final values have been added. `objdump` is intelligent enough to tell
+us where are those instructions pointing (says `2058 <global_symbol>`). Just to
+make sure we can search in the symbol table for the `global_symbol`:
+
+``` asdf
+$ riscv64-linux-gnu-objdump -t b.out | grep global_symbol
+0000000000002058 g     O .bss   0000000000000008              global_symbol
+```
+
+> NOTE: We could try to calculate the address of the `global_symbol` as the
+> linker did, but it's a little bit complicated because we also linked the file
+> with the standard library and the startup files, which adds the `crt` files
+> on top of the file. It's really that we get more code than what we had in the
+> assembly file. If you want to see that, you can see the rest of the output of
+> the command, or even try with `--disassemble-all` and calculate the symbol
+> address by hand. Good luck.
+
+#### More sections
+
+If you want the review some simple things, like a string section, you can use
+`readelf` for that. The `-p` flag (equivalent to `--string-dump=`) displays the
+contents of the section as strings. You can read the `.comment` section that
+way:
+
+``` asdf
+$ riscv64-linux-gnu-readelf -p .comment b.o
+
+String dump of section '.comment':
+  [     1]  GCC: (Debian 10.2.1-6) 10.2.1 20210110
+```
+
+This is what we had inserted in `.ident` on the assembly file by the compiler.
+We have it in the binary too.
+
+In other distros the output is a little bit different. Look the output we have
+in Guix:
+
+``` asdf
+String dump of section '.comment':
+  [     1]  GCC: (GNU) 11.2.0
+```
+
+### Conclusion
+
+So this whole this just to explain that ELF files are some kind of dual files
+that have two different goals at the same time. The executable one is kind of a
+picture of the memory state that can be used for loading that state in the
+memory, while the linking one just describes how different parts of the
+contents relate to each other and has tons of funny tricks to make the files
+relocatable, position independent and that kind of things. Cool.
+
+There are still many fields of ELF we didn't talk about but I consider this
+introduction more than enough. Having a simple understanding about how is the
+file organized and what kind of information it has is probably enough for the
+things we are going to need.
+
+The proposed example shows that with the knowledge obtained by this short
+introduction we can dig a little bit on the files that result from a
+compilation and analyze their internals. That's mostly the work I'll need to do
+when I start combining compilers in a pipeline of death and destruction.
+
+If I ever need to dig on something deeper, I'll do.
+
+Anyway, I'm still unsure if I answered the question we left in the previous
+post[^cliff]:
+
+> Why is learning about ELF interesting if GCC generates assembly?
+
+Did I?
+
+[^cliff]: It was a good cliffhanger, though. 
-- 
cgit v1.2.3