Title: ELF format — why not? Date: 2022-03-14 Category: Tags: Bootstrapping GCC in RISC-V Slug: bootstrapGcc2 Lang: en Summary: Some introduction to ELF as we'll need to deal with this in the future. In the [previous post]({filename}01_internals.md) of the [series]({tag}Bootstrapping GCC in RISC-V) we introduced GCC and how it generates assembly code and we left a question unanswered: *"Why is learning about ELF interesting if GCC generates assembly?"*. In this post we are going to answer that question (not interesting) and maybe understand the very basics of ELF file format (more interesting). ### What's ELF ELF is a file format with two main goals: - Represent an executable file - Represent a linkable file Apart from that, ELF can also represent core dumps, but if you think about that all of the possible options have something in common: they represent contents on the memory. We can simply say ELF is a file format that acts as a picture of the state of the memory. In the case of the executables, the state will be loaded from the file, but in the case of the core dumps the state is obtained from the memory and dumped in a file. Linkable files are those files that can be combined with others to generate executables or shared objects, so they can also fit that definition because they are going to end up in the memory anyway. For efficiency reasons, the ELF format has two separate views of the same contents: - The **Linking** view is based on sections and needs a *section header*. - The **Executable** view is based on segments and needs a *program header*. #### ELF header The ELF header is the only thing that has a fixed position in the file, at the beginning. The ELF header has information that defines how to identify the file, the machine, the endianness and that sort of things, but it also says where are the headers located and identifies the size of their entries and their entry count. It's not that interesting, honestly. The most important thing is it points to the descriptions to both of the views (the headers) so we can check them. #### Linking view Based on sections, the linking view is the most detailed view of the file and it defines how the file should be linked with others in order to create an executable file. Sections, the basic unit of the linking view, are consecutive sequences of bytes that do not overlap. There are [different types of sections according to their possible contents and meaning](https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA.junk/sections.html), the most interesting are: - `SYMTAB` and `DYNSYM` that hold a symbol table. The `DYNSYM` is for dynamic linking symbols, while `SYMTAB` normally is used for static linking but may contain both. - `STRTAB` holds a string table. - `RELA` contains relocation entries with addends and `REL` contains relocations without addends. - `NOTE` section contains some information of the file. - `HASH` contains a symbol hash table, necessary for dynamic linking. - `DYNAMIC` for dynamic linking information. Each section has also a `name`, an `address` if it is supposed to appear in the memory of running process, an `offset` that defines where in the file do the section's contents appear, a `size`, and some extra data fields that all together form a section header entry. The section header entries are all located where the ELF header says, one after the other (like a C array of structures), so the programs just need to access that position in the file and read all the headers in a row. The contents of the sections are located throughout the file, where the section headers point. ##### String section The string section (`STRTAB`) is one of the simplest. It contains all the strings of the file: the section and symbol names. It's simply a set of null terminated strings, written one after the other (it also starts with a null character but whatever). Anywhere in the file where we are supposed to get an string what we get is an index that points to the first position in this section to read from. We should read from that until we reach a null character. For example in the following string section: ``` \0 h e l l o \0 n a m e \0 ``` If a name of a section says `1`, the actual name of the section is `hello` and if it says `7` it would be `name`. Also, if it says `9` it would be `me`, this trick could be used too. ##### Symbol table The symbol table contains information needed to locate and relocate a program's symbolic definitions and references. The symbol table is formed as an array of symbol elements that are defined with a `name`, obviously a `value`, their `size`, some extra `info`, the index of the section header they relate to (`shndx`) and some `other` stuff. The `info` field manages symbol's type (`OBJECT` for data, `FUNC` for function...) and binding attributes, which define the linking visibility and behavior of the symbol (local vs global...). The `value` can be interpreted in several ways too, depending on the type of the symbol you are dealing with. But that's not really relevant for us at the moment. ##### Relocation According to the ELF documentation I got from somewhere I don't really remember: > The relocation is the process of connecting symbolic references with symbolic > definitions. I hope it's more explanatory for you than what it is to me, but I don't have a clue of what that is supposed to mean. The [Wikipedia](https://en.wikipedia.org/wiki/Relocation_(computing)) does a **much better** job in the specifics right here: > Relocation is the process of assigning load addresses for position-dependent > code and data of a program and adjusting the code and data to reflect the > assigned addresses. If this doesn't really help, you have a really good example later, but we can basically say that it's a way to adjust the code to point to the correct addresses, at linking or loading, or even execution, time. ELF files have, as we said, sections that let us define relocations. These will point to some parts of the file and tell the linker or the loader that that positions of the file must be reprocessed. There are two types of relocation sections and in both of them the relocation section is an array of entries where each of them represents one relocation. In the simple one (`REL`) each relocation only contains an `offset` and an `info` word, which also includes the type of relocation to apply. The more complex one (`RELA`) is mostly the same but it includes an `addend` which includes a constant value to use in calculation of the relocation. The calculus of the final addresses are specific to the ISA and the relocation type, because processors have different instruction formats and different ways to pack addresses in instructions. RISC-V has no way to pack a full address inside of an instruction, while x86 does, so they have to patch the instructions in a different way. ##### Special sections Some sections have a special treatment according to their name, normally the ones that start with a dot. These you might have found in the past in assembly files, defined like `.data` (for data), `.rodata` (for read only data) or `.text` (for code). These are interesting to have in mind because they appear the same way they do in assembly, and we are going to disassemble some of them and play around with them. Other special sections like `.got` or `.dynamic` don't appear in assembly but they have a strong meaning in the resulting file, we are not going to deal with those today because we want to finish this post someday. If you need to deal with those I recommend you to read ELF's documentation on special sections and the loading process. #### Executable view The executable view is another way to access the same contents, but with a different perspective. It's based on *segments* rather than *sections*. Segments are also pieces of the file, as sections are, but segments can contain one or more sections. Like in the linking view, the base unit, sections for the linking view but for segments for the executable view, are described in a header. The header of the executable view is called program header and it is, like the section header, a bunch of structures piled together, each describing one of the segments. The program header describes the position and size in the file of each of the segments but also some important information about them: how they are supposed to be loaded in the memory and where (virtual address and physical address), the type of the segment, and some info more. The most interesting segment types are the following: - `LOAD` is used for loadable segments, with the other fields of the segment the position and the size this segment will have in memory are described. - `DYNAMIC` are segments that have some dynamic linking information. It has to contain the `.dynamic` section. - `INTERP` gives the location and size of a null-terminated path name to invoke as an *interpreter*. Interpreter in this context usually means a dynamic linker, which will be called instead of loading this file to memory and the dynamic linker will be the one that will load the parts of the file it considers. You can see how segments are interesting for loading the file in the memory, that is, they are mostly interesting for executable files or shared objects. #### Segments vs Sections If you want to have a clear idea about the difference between segments and sections, you can consider a file with multiple sections: `.text`, `.rodata` and `.data`. A file that contains those sections can be understood from a linking perspective as a file that has some code (`.text`), read-only data (`.rodata`) and read-write data (`.data`). Each of those parts must be managed in a different way by the linker, but the reality is that the program loader doesn't really care about some of the differences of them. The code and the read-only data are loaded in the memory in the same way, with read and execute permission but no write permission, so the executable view can put both sections in the same segment, and make the loader's life easier. Also, the linker doesn't really care about how is the memory loaded so the section header does not hold that information. It does care about the section's goals though, as it will need to put them together in order during the linking. On the other hand, the loader is not really interested on what's the goal of the contents of the file but only on what to do with those contents, so it only has that information. ### So, why do we need to learn it? We don't really need to learn it very deeply, just learn how it works in a high-level way and make sure we are able to read it with the tools we have available. The good news for you is if the reasons I give you are not good enough it doesn't really matter because you already learned[^gotcha]. Continue reading and you'll realize how much you understand now. [^gotcha]: Ha! Gotcha! First, let me tell you a personal story. I have previous experience working with assembly, but only in small devices that have two memories, one for data and other for code (Hardvard Architecture). In those small devices you often don't really need to think about how the code and the data is mapped to memory because your programs are small and the separation is clear. Computers are a different thing, and I have had issues understanding this whole assembly thing. Computers store both code and data in the same memory, the main memory, (Von Neumann Architecture) and they normally have memory segmentation, pagination, memory management units and all that kind of stuff, because there are many processes running and they want to separate one from the other. That forces us to think about how the code and the data are mapped to the memory. Also, modern operating systems also use dynamic linkers, which are not available in small devices, and we need to be able to deal with that amount of complexity. ELF allows us to make that all, because it was born for that. ELF is a distillation of many of the ideas from System V Unix, that include exactly all I mentioned. It's a great way to understand how memory, linking and processes work in a *modern* operating system. This is why you need to learn it, at least a little. It makes you a cultivated person, which is always good[^system-v]. [^system-v]: It also makes you understand the complexities of the system so you can criticize it. Changing the world requires to learn about it first. #### The specifics As I'm sure you are not satisfied totally with the answer of being a cultivated person[^some-of-you], let me go for some specifics. [^some-of-you]: For those that really are. That's the good attitude in life. High five. You can read the whole section still, it has interesting points I think. So in this project GCC is not the only software we are dealing with, GNU Binutils and TinyCC are part of the party too, and I need to make them fit together in the best way possible. In those I need to make sure the relocations, formats and other things work properly, following the RISC-V ABI specification for ELF. That might be a point of failure, so being prepared on a high-level at least is interesting. Of course, GCC's output we need to analyze too, and in order to do that we need to make sure we know what it means. We already saw that some ELF sections are directly mentioned in the assembly, so in order to know their meanings ELF is a good way to understand them. They are really an OS related thing and ELF only reflects it, but learning them from the ELF perspective makes the path easier probably. Relocations are a huge point in all this mess, because they are machine specific (instructions are too, but those I expect us to know already), and they are something I didn't need to research on all the RISC-V adventures I had last year. I have to do it sometime. In general, there are many sharp edges where we can get hurt, so it's better if we wear gloves. ### Tools For all this process there are a couple of tools that were designed to help. GNU Binutils has many of them but we are going to focus on two, as they are more than enough for many usecases: `objdump` and `readelf`. The example below uses both of them to analyze a piece of code and its compilation result. As you'll see, the main problem they have is their output: it's not always clear, the formatting is a little bit chaotic, it's not obvious at all to get right and it's really hard to use it procedurally. There is a really cool tool you should investigate though, called GNU Poke, that is designed specifically to fight against those issues. I recommend you to [take a look to it](https://www.gnu.org/software/poke/). ### Example Starting from a very simple C file we can follow a really interesting process and understand some of the ELF internals: ``` c long global_symbol; int main() { return global_symbol != 0; } ``` We compile it to assembly with: ``` asdf $ riscv64-linux-gnu-gcc -S b.c -O0 ``` This are the contents of the assembly file: ``` asm .file "b.c" .option pic .text .globl global_symbol .bss .align 3 .type global_symbol, @object .size global_symbol, 8 global_symbol: .zero 8 .text .align 1 .globl main .type main, @function main: addi sp,sp,-16 sd s0,8(sp) addi s0,sp,16 lla a5,global_symbol ld a5,0(a5) snez a5,a5 andi a5,a5,0xff sext.w a5,a5 mv a0,a5 ld s0,8(sp) addi sp,sp,16 jr ra .size main, .-main .ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110" .section .note.GNU-stack,"",@progbits ``` Assemble the file with `as`: ``` asdf $ riscv64-linux-gnu-as b.s -o b.o ``` And this is what we get in `b.o`. The `.text` section contains the following: ``` asm $ riscv64-linux-gnu-objdump --disassemble b.o b.o: file format elf64-littleriscv Disassembly of section .text: 0000000000000000
: 0: ff010113 addi sp,sp,-16 4: 00813423 sd s0,8(sp) 8: 01010413 addi s0,sp,16 c: 00000797 auipc a5,0x0 10: 00078793 mv a5,a5 14: 0007b783 ld a5,0(a5) # c 18: 00f037b3 snez a5,a5 1c: 0ff7f793 andi a5,a5,255 20: 0007879b sext.w a5,a5 24: 00078513 mv a0,a5 28: 00813403 ld s0,8(sp) 2c: 01010113 addi sp,sp,16 30: 00008067 ret ``` ### Relocations There are some relocations! ``` asdf $ riscv64-linux-gnu-objdump b.o -r b.o: file format elf64-littleriscv RELOCATION RECORDS FOR [.text]: OFFSET TYPE VALUE 000000000000000c R_RISCV_PCREL_HI20 global_symbol 000000000000000c R_RISCV_RELAX *ABS* 0000000000000010 R_RISCV_PCREL_LO12_I .L0 0000000000000010 R_RISCV_RELAX *ABS* ``` But in order to understand those relocations properly we need to check the value of the symbols too: ``` asdf $ riscv64-linux-gnu-objdump -t b.o b.o: file format elf64-littleriscv SYMBOL TABLE: 0000000000000000 l df *ABS* 0000000000000000 b.c 0000000000000000 l d .text 0000000000000000 .text 0000000000000000 l d .data 0000000000000000 .data 0000000000000000 l d .bss 0000000000000000 .bss 0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack 000000000000000c l .text 0000000000000000 .L0 0000000000000000 l d .comment 0000000000000000 .comment 0000000000000000 g O .bss 0000000000000008 global_symbol 0000000000000000 g F .text 0000000000000034 main ``` If you pay attention to the offsets of those relocations (`0x0c` and `0x10`) they exactly match the instructions `auipc a5, 0x0` and `mv a5, a5` and those are expanded from the `lla a5, global_symbol` (load local address) pseudoinstruction from the assembly. The `mv` is not really a `mv`. `mv` is a pseudoinstruction too, that should be expanded to an `addi a5, a5, 0`. The `objdump` is playing with us, making the opposite conversion so we can read better but in fact is tricking us. The `auipc` + `addi` couple in RISC-V appears pretty often, because it's the method it has to load addresses in memory. The first instruction, `auipc` adds a high part of an immediate to the program counter and stores the result in a register, the `addi` adds then another, in this case low, immediate to the register i.e. they make a `x[reg] = pc + immediate` operation in two steps: `x[reg] = pc + hi20(immediate)` followed by `x[reg] = x[reg] + lo12(immediate)`. As we have relocations in both `auipc` and `addi` this means their `0` values (the immediates) are going to be overwritten with something else at linking time, and there's when RISC-V has something to say. All the relocations we can see are RISC-V specific, and you can read about them in [RISC-V ABI Specification](https://github.com/riscv-non-isa/riscv-elf-psabi-doc). In our case we have really some simple ones, the easiest to understand (what a coincidence, huh?): > `R_RISCV_PCREL_HI20`: High 20 bits of 32-bit PC-relative reference, > `%pcrel_hi(symbol)`. The formula is: `S+A-P` [but only obtains the highest 20 > bits]. > `R_RISCV_PCREL_LO12_I`: Low 12 bits of a 32-bit PC-relative, > `%pcrel_lo(address of %pcrel_hi)`, the addend must be 0. The formula is: > `S-P` [but it only obtains the lowest 12 bits]. Both the `HI20` and the `LO12` have a similar formula, this is the meaning of the elements on the formula: - `S`: Address of the symbol - `A`: Addend of the relocation - `P`: Position of the relocation If you match their formulas with the description of what we just said about how do `auipc` + `addi` couples work, you can easily understand the formulas and their meaning. We are not going to do it, do something yourself! The other relocation: > `R_RISCV_RELAX`: Instruction can be relaxed, paired with a normal relocation > at the same address. Is an addition our example doesn't use but it could. The `R_RISCV_RELAX` basically means that if the relocation it points at is not needed it can be discarded. And when does that happen? Easy, when we can get `global_symbol`'s address with only one of them, we can remove the other instruction from the program. #### Relocation resolution If we link the file and generate an executable, we can see the final value those zeroes get. ``` asdf $ riscv64-linux-gnu-gcc b.o -o b.out ``` We link it like this because `ld` needs a lot of input fields and we don't want to set them all by hand, but you can do it with `ld` if you feel like it. ``` asdf $ riscv64-linux-gnu-objdump --disassemble b.out ... 00000000000005e4
: 5e4: ff010113 addi sp,sp,-16 5e8: 00813423 sd s0,8(sp) 5ec: 01010413 addi s0,sp,16 5f0: 00002797 auipc a5,0x2 5f4: a6878793 addi a5,a5,-1432 # 2058 5f8: 0007b783 ld a5,0(a5) 5fc: 00f037b3 snez a5,a5 600: 0ff7f793 andi a5,a5,255 604: 0007879b sext.w a5,a5 608: 00078513 mv a0,a5 60c: 00813403 ld s0,8(sp) 610: 01010113 addi sp,sp,16 614: 00008067 ret ... ``` There you see the relocation was resolved (`0x5f0` and `0x5f4`) by the linker and the final values have been added. `objdump` is intelligent enough to tell us where are those instructions pointing (says `2058 `). Just to make sure we can search in the symbol table for the `global_symbol`: ``` asdf $ riscv64-linux-gnu-objdump -t b.out | grep global_symbol 0000000000002058 g O .bss 0000000000000008 global_symbol ``` > NOTE: We could try to calculate the address of the `global_symbol` as the > linker did, but it's a little bit complicated because we also linked the file > with the standard library and the startup files, which adds the `crt` files > on top of the file. It's really that we get more code than what we had in the > assembly file. If you want to see that, you can see the rest of the output of > the command, or even try with `--disassemble-all` and calculate the symbol > address by hand. Good luck. #### More sections If you want the review some simple things, like a string section, you can use `readelf` for that. The `-p` flag (equivalent to `--string-dump=`) displays the contents of the section as strings. You can read the `.comment` section that way: ``` asdf $ riscv64-linux-gnu-readelf -p .comment b.o String dump of section '.comment': [ 1] GCC: (Debian 10.2.1-6) 10.2.1 20210110 ``` This is what we had inserted in `.ident` on the assembly file by the compiler. We have it in the binary too. In other distros the output is a little bit different. Look the output we have in Guix: ``` asdf String dump of section '.comment': [ 1] GCC: (GNU) 11.2.0 ``` ### Conclusion So this whole this just to explain that ELF files are some kind of dual files that have two different goals at the same time. The executable one is kind of a picture of the memory state that can be used for loading that state in the memory, while the linking one just describes how different parts of the contents relate to each other and has tons of funny tricks to make the files relocatable, position independent and that kind of things. Cool. There are still many fields of ELF we didn't talk about but I consider this introduction more than enough. Having a simple understanding about how is the file organized and what kind of information it has is probably enough for the things we are going to need. The proposed example shows that with the knowledge obtained by this short introduction we can dig a little bit on the files that result from a compilation and analyze their internals. That's mostly the work I'll need to do when I start combining compilers in a pipeline of death and destruction. If I ever need to dig on something deeper, I'll do. Anyway, I'm still unsure if I answered the question we left in the previous post[^cliff]: > Why is learning about ELF interesting if GCC generates assembly? Did I? [^cliff]: It was a good cliffhanger, though.