From 57ebc576bf74ca887f9c77cb6b7771b9ff2b1843 Mon Sep 17 00:00:00 2001 From: Ekaitz Zarraga Date: Mon, 14 Mar 2022 23:52:34 +0100 Subject: NlNet project posts 0-2 --- content/bootstrapGcc/02_elf.md | 595 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 595 insertions(+) create mode 100644 content/bootstrapGcc/02_elf.md (limited to 'content/bootstrapGcc/02_elf.md') diff --git a/content/bootstrapGcc/02_elf.md b/content/bootstrapGcc/02_elf.md new file mode 100644 index 0000000..c684c63 --- /dev/null +++ b/content/bootstrapGcc/02_elf.md @@ -0,0 +1,595 @@ +Title: ELF format — why not? +Date: 2022-03-14 +Category: +Tags: Bootstrapping GCC in RISC-V +Slug: bootstrapGcc2 +Lang: en +Summary: + Some introduction to ELF as we'll need to deal with this in the future. + +In the [previous post]({filename}01_internals.md) of the +[series]({tag}Bootstrapping GCC in RISC-V) we introduced GCC and how it +generates assembly code and we left a question unanswered: *"Why is learning +about ELF interesting if GCC generates assembly?"*. In this post we are going +to answer that question (not interesting) and maybe understand the very basics +of ELF file format (more interesting). + + +### What's ELF + +ELF is a file format with two main goals: + +- Represent an executable file +- Represent a linkable file + +Apart from that, ELF can also represent core dumps, but if you think about that +all of the possible options have something in common: they represent contents +on the memory. We can simply say ELF is a file format that acts as a picture of +the state of the memory. In the case of the executables, the state will be +loaded from the file, but in the case of the core dumps the state is obtained +from the memory and dumped in a file. + +Linkable files are those files that can be combined with others to generate +executables or shared objects, so they can also fit that definition because +they are going to end up in the memory anyway. + +For efficiency reasons, the ELF format has two separate views of the same +contents: + +- The **Linking** view is based on sections and needs a *section header*. +- The **Executable** view is based on segments and needs a *program header*. + +#### ELF header + +The ELF header is the only thing that has a fixed position in the file, at the +beginning. The ELF header has information that defines how to identify the +file, the machine, the endianness and that sort of things, but it also says +where are the headers located and identifies the size of their entries and +their entry count. + +It's not that interesting, honestly. The most important thing is it points to +the descriptions to both of the views (the headers) so we can check them. + +#### Linking view + +Based on sections, the linking view is the most detailed view of the file and +it defines how the file should be linked with others in order to create an +executable file. + +Sections, the basic unit of the linking view, are consecutive sequences of +bytes that do not overlap. + +There are [different types of sections according to their possible contents and +meaning](https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA.junk/sections.html), +the most interesting are: + +- `SYMTAB` and `DYNSYM` that hold a symbol table. The `DYNSYM` is for dynamic + linking symbols, while `SYMTAB` normally is used for static linking but may + contain both. +- `STRTAB` holds a string table. +- `RELA` contains relocation entries with addends and `REL` contains + relocations without addends. +- `NOTE` section contains some information of the file. +- `HASH` contains a symbol hash table, necessary for dynamic linking. +- `DYNAMIC` for dynamic linking information. + +Each section has also a `name`, an `address` if it is supposed to appear in the +memory of running process, an `offset` that defines where in the file do the +section's contents appear, a `size`, and some extra data fields that all +together form a section header entry. + +The section header entries are all located where the ELF header says, one after +the other (like a C array of structures), so the programs just need to access +that position in the file and read all the headers in a row. The contents of +the sections are located throughout the file, where the section headers point. + +##### String section + +The string section (`STRTAB`) is one of the simplest. It contains all the +strings of the file: the section and symbol names. It's simply a set of null +terminated strings, written one after the other (it also starts with a null +character but whatever). + +Anywhere in the file where we are supposed to get an string what we get is an +index that points to the first position in this section to read from. We should +read from that until we reach a null character. For example in the following +string section: + +``` + \0 h e l l o \0 n a m e \0 +``` + +If a name of a section says `1`, the actual name of the section is `hello` +and if it says `7` it would be `name`. Also, if it says `9` it would be `me`, +this trick could be used too. + +##### Symbol table + +The symbol table contains information needed to locate and relocate a program's +symbolic definitions and references. The symbol table is formed as an array of +symbol elements that are defined with a `name`, obviously a `value`, their +`size`, some extra `info`, the index of the section header they relate to +(`shndx`) and some `other` stuff. + +The `info` field manages symbol's type (`OBJECT` for data, `FUNC` for +function...) and binding attributes, which define the linking visibility and +behavior of the symbol (local vs global...). + +The `value` can be interpreted in several ways too, depending on the type of +the symbol you are dealing with. But that's not really relevant for us at the +moment. + +##### Relocation + +According to the ELF documentation I got from somewhere I don't really +remember: + +> The relocation is the process of connecting symbolic references with symbolic +> definitions. + +I hope it's more explanatory for you than what it is to me, but I don't have a +clue of what that is supposed to mean. The +[Wikipedia](https://en.wikipedia.org/wiki/Relocation_(computing)) does a **much +better** job in the specifics right here: + +> Relocation is the process of assigning load addresses for position-dependent +> code and data of a program and adjusting the code and data to reflect the +> assigned addresses. + +If this doesn't really help, you have a really good example later, but we can +basically say that it's a way to adjust the code to point to the correct +addresses, at linking or loading, or even execution, time. + +ELF files have, as we said, sections that let us define relocations. These will +point to some parts of the file and tell the linker or the loader that that +positions of the file must be reprocessed. + +There are two types of relocation sections and in both of them the relocation +section is an array of entries where each of them represents one relocation. +In the simple one (`REL`) each relocation only contains an `offset` and an +`info` word, which also includes the type of relocation to apply. The more +complex one (`RELA`) is mostly the same but it includes an `addend` which +includes a constant value to use in calculation of the relocation. + +The calculus of the final addresses are specific to the ISA and the relocation +type, because processors have different instruction formats and different ways +to pack addresses in instructions. RISC-V has no way to pack a full address +inside of an instruction, while x86 does, so they have to patch the +instructions in a different way. + + +##### Special sections + +Some sections have a special treatment according to their name, normally the +ones that start with a dot. These you might have found in the past in assembly +files, defined like `.data` (for data), `.rodata` (for read only data) or +`.text` (for code). + +These are interesting to have in mind because they appear the same way they do +in assembly, and we are going to disassemble some of them and play around with +them. + +Other special sections like `.got` or `.dynamic` don't appear in assembly but +they have a strong meaning in the resulting file, we are not going to deal with +those today because we want to finish this post someday. If you need to deal +with those I recommend you to read ELF's documentation on special sections and +the loading process. + + +#### Executable view + +The executable view is another way to access the same contents, but with a +different perspective. It's based on *segments* rather than *sections*. +Segments are also pieces of the file, as sections are, but segments can contain +one or more sections. + +Like in the linking view, the base unit, sections for the linking view but for +segments for the executable view, are described in a header. The header of the +executable view is called program header and it is, like the section header, a +bunch of structures piled together, each describing one of the segments. + +The program header describes the position and size in the file of each of the +segments but also some important information about them: how they are supposed +to be loaded in the memory and where (virtual address and physical address), +the type of the segment, and some info more. + +The most interesting segment types are the following: + +- `LOAD` is used for loadable segments, with the other fields of the segment + the position and the size this segment will have in memory are described. +- `DYNAMIC` are segments that have some dynamic linking information. It has to + contain the `.dynamic` section. +- `INTERP` gives the location and size of a null-terminated path name to invoke + as an *interpreter*. Interpreter in this context usually means a dynamic + linker, which will be called instead of loading this file to memory and the + dynamic linker will be the one that will load the parts of the file it + considers. + +You can see how segments are interesting for loading the file in the memory, +that is, they are mostly interesting for executable files or shared objects. + +#### Segments vs Sections + +If you want to have a clear idea about the difference between segments and +sections, you can consider a file with multiple sections: `.text`, `.rodata` +and `.data`. + +A file that contains those sections can be understood from a linking +perspective as a file that has some code (`.text`), read-only data (`.rodata`) +and read-write data (`.data`). Each of those parts must be managed in a +different way by the linker, but the reality is that the program loader doesn't +really care about some of the differences of them. + +The code and the read-only data are loaded in the memory in the same way, with +read and execute permission but no write permission, so the executable view can +put both sections in the same segment, and make the loader's life easier. + +Also, the linker doesn't really care about how is the memory loaded so the +section header does not hold that information. It does care about the section's +goals though, as it will need to put them together in order during the linking. +On the other hand, the loader is not really interested on what's the goal of +the contents of the file but only on what to do with those contents, so it only +has that information. + + +### So, why do we need to learn it? + +We don't really need to learn it very deeply, just learn how it works in a +high-level way and make sure we are able to read it with the tools we have +available. The good news for you is if the reasons I give you are not good +enough it doesn't really matter because you already learned[^gotcha]. Continue +reading and you'll realize how much you understand now. + +[^gotcha]: Ha! Gotcha! + +First, let me tell you a personal story. I have previous experience working +with assembly, but only in small devices that have two memories, one for data +and other for code (Hardvard Architecture). In those small devices you often +don't really need to think about how the code and the data is mapped to memory +because your programs are small and the separation is clear. Computers are a +different thing, and I have had issues understanding this whole assembly thing. + +Computers store both code and data in the same memory, the main memory, (Von +Neumann Architecture) and they normally have memory segmentation, pagination, +memory management units and all that kind of stuff, because there are many +processes running and they want to separate one from the other. That forces us +to think about how the code and the data are mapped to the memory. Also, modern +operating systems also use dynamic linkers, which are not available in small +devices, and we need to be able to deal with that amount of complexity. + +ELF allows us to make that all, because it was born for that. ELF is a +distillation of many of the ideas from System V Unix, that include exactly all +I mentioned. It's a great way to understand how memory, linking and processes +work in a *modern* operating system. This is why you need to learn it, at least +a little. It makes you a cultivated person, which is always good[^system-v]. + +[^system-v]: It also makes you understand the complexities of the system so you + can criticize it. Changing the world requires to learn about it first. + +#### The specifics + +As I'm sure you are not satisfied totally with the answer of being a cultivated +person[^some-of-you], let me go for some specifics. + +[^some-of-you]: For those that really are. That's the good attitude in life. + High five. You can read the whole section still, it has interesting points I + think. + +So in this project GCC is not the only software we are dealing with, GNU +Binutils and TinyCC are part of the party too, and I need to make them fit +together in the best way possible. In those I need to make sure the +relocations, formats and other things work properly, following the RISC-V ABI +specification for ELF. That might be a point of failure, so being prepared on a +high-level at least is interesting. + +Of course, GCC's output we need to analyze too, and in order to do that we need +to make sure we know what it means. We already saw that some ELF sections are +directly mentioned in the assembly, so in order to know their meanings ELF is a +good way to understand them. They are really an OS related thing and ELF only +reflects it, but learning them from the ELF perspective makes the path easier +probably. + +Relocations are a huge point in all this mess, because they are machine +specific (instructions are too, but those I expect us to know already), and +they are something I didn't need to research on all the RISC-V adventures I had +last year. I have to do it sometime. + +In general, there are many sharp edges where we can get hurt, so it's better if +we wear gloves. + +### Tools + +For all this process there are a couple of tools that were designed to help. +GNU Binutils has many of them but we are going to focus on two, as they are +more than enough for many usecases: `objdump` and `readelf`. + +The example below uses both of them to analyze a piece of code and its +compilation result. As you'll see, the main problem they have is their output: +it's not always clear, the formatting is a little bit chaotic, it's not +obvious at all to get right and it's really hard to use it procedurally. + +There is a really cool tool you should investigate though, called GNU Poke, +that is designed specifically to fight against those issues. I recommend you to +[take a look to it](https://www.gnu.org/software/poke/). + +### Example + +Starting from a very simple C file we can follow a really interesting process +and understand some of the ELF internals: + +``` c +long global_symbol; + +int main() { + return global_symbol != 0; +} +``` + +We compile it to assembly with: + +``` asdf +$ riscv64-linux-gnu-gcc -S b.c -O0 +``` + +This are the contents of the assembly file: + +``` asm + .file "b.c" + .option pic + .text + .globl global_symbol + .bss + .align 3 + .type global_symbol, @object + .size global_symbol, 8 +global_symbol: + .zero 8 + .text + .align 1 + .globl main + .type main, @function +main: + addi sp,sp,-16 + sd s0,8(sp) + addi s0,sp,16 + lla a5,global_symbol + ld a5,0(a5) + snez a5,a5 + andi a5,a5,0xff + sext.w a5,a5 + mv a0,a5 + ld s0,8(sp) + addi sp,sp,16 + jr ra + .size main, .-main + .ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110" + .section .note.GNU-stack,"",@progbits +``` + +Assemble the file with `as`: + +``` asdf +$ riscv64-linux-gnu-as b.s -o b.o +``` + +And this is what we get in `b.o`. The `.text` section contains the following: + +``` asm +$ riscv64-linux-gnu-objdump --disassemble b.o + +b.o: file format elf64-littleriscv + + +Disassembly of section .text: + +0000000000000000
: + 0: ff010113 addi sp,sp,-16 + 4: 00813423 sd s0,8(sp) + 8: 01010413 addi s0,sp,16 + c: 00000797 auipc a5,0x0 + 10: 00078793 mv a5,a5 + 14: 0007b783 ld a5,0(a5) # c + 18: 00f037b3 snez a5,a5 + 1c: 0ff7f793 andi a5,a5,255 + 20: 0007879b sext.w a5,a5 + 24: 00078513 mv a0,a5 + 28: 00813403 ld s0,8(sp) + 2c: 01010113 addi sp,sp,16 + 30: 00008067 ret +``` + +### Relocations + +There are some relocations! + +``` asdf +$ riscv64-linux-gnu-objdump b.o -r + +b.o: file format elf64-littleriscv + +RELOCATION RECORDS FOR [.text]: +OFFSET TYPE VALUE +000000000000000c R_RISCV_PCREL_HI20 global_symbol +000000000000000c R_RISCV_RELAX *ABS* +0000000000000010 R_RISCV_PCREL_LO12_I .L0 +0000000000000010 R_RISCV_RELAX *ABS* +``` + +But in order to understand those relocations properly we need to check the +value of the symbols too: + +``` asdf +$ riscv64-linux-gnu-objdump -t b.o + +b.o: file format elf64-littleriscv + +SYMBOL TABLE: +0000000000000000 l df *ABS* 0000000000000000 b.c +0000000000000000 l d .text 0000000000000000 .text +0000000000000000 l d .data 0000000000000000 .data +0000000000000000 l d .bss 0000000000000000 .bss +0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack +000000000000000c l .text 0000000000000000 .L0 +0000000000000000 l d .comment 0000000000000000 .comment +0000000000000000 g O .bss 0000000000000008 global_symbol +0000000000000000 g F .text 0000000000000034 main +``` + +If you pay attention to the offsets of those relocations (`0x0c` and `0x10`) +they exactly match the instructions `auipc a5, 0x0` and `mv a5, a5` and those +are expanded from the `lla a5, global_symbol` (load local address) +pseudoinstruction from the assembly. + +The `mv` is not really a `mv`. `mv` is a pseudoinstruction too, that should be +expanded to an `addi a5, a5, 0`. The `objdump` is playing with us, making the +opposite conversion so we can read better but in fact is tricking us. + +The `auipc` + `addi` couple in RISC-V appears pretty often, because it's the +method it has to load addresses in memory. The first instruction, `auipc` adds +a high part of an immediate to the program counter and stores the result in a +register, the `addi` adds then another, in this case low, immediate to the +register i.e. they make a `x[reg] = pc + immediate` operation in two steps: +`x[reg] = pc + hi20(immediate)` followed by `x[reg] = x[reg] + lo12(immediate)`. + +As we have relocations in both `auipc` and `addi` this means their `0` values +(the immediates) are going to be overwritten with something else at linking +time, and there's when RISC-V has something to say. All the relocations we can +see are RISC-V specific, and you can read about them in [RISC-V ABI +Specification](https://github.com/riscv-non-isa/riscv-elf-psabi-doc). + +In our case we have really some simple ones, the easiest to understand (what a +coincidence, huh?): + +> `R_RISCV_PCREL_HI20`: High 20 bits of 32-bit PC-relative reference, +> `%pcrel_hi(symbol)`. The formula is: `S+A-P` [but only obtains the highest 20 +> bits]. + +> `R_RISCV_PCREL_LO12_I`: Low 12 bits of a 32-bit PC-relative, +> `%pcrel_lo(address of %pcrel_hi)`, the addend must be 0. The formula is: +> `S-P` [but it only obtains the lowest 12 bits]. + +Both the `HI20` and the `LO12` have a similar formula, this is the meaning of +the elements on the formula: + +- `S`: Address of the symbol +- `A`: Addend of the relocation +- `P`: Position of the relocation + +If you match their formulas with the description of what we just said about how +do `auipc` + `addi` couples work, you can easily understand the formulas and +their meaning. We are not going to do it, do something yourself! + +The other relocation: + +> `R_RISCV_RELAX`: Instruction can be relaxed, paired with a normal relocation +> at the same address. + +Is an addition our example doesn't use but it could. The `R_RISCV_RELAX` +basically means that if the relocation it points at is not needed it can be +discarded. And when does that happen? Easy, when we can get `global_symbol`'s +address with only one of them, we can remove the other instruction from the +program. + +#### Relocation resolution + +If we link the file and generate an executable, we can see the final value +those zeroes get. + +``` asdf +$ riscv64-linux-gnu-gcc b.o -o b.out +``` +We link it like this because `ld` needs a lot of input fields and we don't want +to set them all by hand, but you can do it with `ld` if you feel like it. + +``` asdf +$ riscv64-linux-gnu-objdump --disassemble b.out +... +00000000000005e4
: + 5e4: ff010113 addi sp,sp,-16 + 5e8: 00813423 sd s0,8(sp) + 5ec: 01010413 addi s0,sp,16 + 5f0: 00002797 auipc a5,0x2 + 5f4: a6878793 addi a5,a5,-1432 # 2058 + 5f8: 0007b783 ld a5,0(a5) + 5fc: 00f037b3 snez a5,a5 + 600: 0ff7f793 andi a5,a5,255 + 604: 0007879b sext.w a5,a5 + 608: 00078513 mv a0,a5 + 60c: 00813403 ld s0,8(sp) + 610: 01010113 addi sp,sp,16 + 614: 00008067 ret +... +``` + +There you see the relocation was resolved (`0x5f0` and `0x5f4`) by the linker +and the final values have been added. `objdump` is intelligent enough to tell +us where are those instructions pointing (says `2058 `). Just to +make sure we can search in the symbol table for the `global_symbol`: + +``` asdf +$ riscv64-linux-gnu-objdump -t b.out | grep global_symbol +0000000000002058 g O .bss 0000000000000008 global_symbol +``` + +> NOTE: We could try to calculate the address of the `global_symbol` as the +> linker did, but it's a little bit complicated because we also linked the file +> with the standard library and the startup files, which adds the `crt` files +> on top of the file. It's really that we get more code than what we had in the +> assembly file. If you want to see that, you can see the rest of the output of +> the command, or even try with `--disassemble-all` and calculate the symbol +> address by hand. Good luck. + +#### More sections + +If you want the review some simple things, like a string section, you can use +`readelf` for that. The `-p` flag (equivalent to `--string-dump=`) displays the +contents of the section as strings. You can read the `.comment` section that +way: + +``` asdf +$ riscv64-linux-gnu-readelf -p .comment b.o + +String dump of section '.comment': + [ 1] GCC: (Debian 10.2.1-6) 10.2.1 20210110 +``` + +This is what we had inserted in `.ident` on the assembly file by the compiler. +We have it in the binary too. + +In other distros the output is a little bit different. Look the output we have +in Guix: + +``` asdf +String dump of section '.comment': + [ 1] GCC: (GNU) 11.2.0 +``` + +### Conclusion + +So this whole this just to explain that ELF files are some kind of dual files +that have two different goals at the same time. The executable one is kind of a +picture of the memory state that can be used for loading that state in the +memory, while the linking one just describes how different parts of the +contents relate to each other and has tons of funny tricks to make the files +relocatable, position independent and that kind of things. Cool. + +There are still many fields of ELF we didn't talk about but I consider this +introduction more than enough. Having a simple understanding about how is the +file organized and what kind of information it has is probably enough for the +things we are going to need. + +The proposed example shows that with the knowledge obtained by this short +introduction we can dig a little bit on the files that result from a +compilation and analyze their internals. That's mostly the work I'll need to do +when I start combining compilers in a pipeline of death and destruction. + +If I ever need to dig on something deeper, I'll do. + +Anyway, I'm still unsure if I answered the question we left in the previous +post[^cliff]: + +> Why is learning about ELF interesting if GCC generates assembly? + +Did I? + +[^cliff]: It was a good cliffhanger, though. -- cgit v1.2.3