diff options
author | Ekaitz Zarraga <ekaitz@elenq.tech> | 2021-06-26 22:54:53 +0200 |
---|---|---|
committer | Ekaitz Zarraga <ekaitz@elenq.tech> | 2021-06-26 22:54:53 +0200 |
commit | 7e7a956916899a2f35a22ec12ae08cd56e12e52f (patch) | |
tree | 9d9e76934779d38c92531f7a9a8ebf7b72a57432 /content | |
parent | 0caec2aaba3446b4b23b91f2f07318c43359c196 (diff) |
Add machine code generation entry
Diffstat (limited to 'content')
-rw-r--r-- | content/machine-code-generation.md | 895 |
1 files changed, 895 insertions, 0 deletions
diff --git a/content/machine-code-generation.md b/content/machine-code-generation.md new file mode 100644 index 0000000..ebf2fe2 --- /dev/null +++ b/content/machine-code-generation.md @@ -0,0 +1,895 @@ +Title: Lessons learned on machine code generation +Date: 2021-06-16 +Category: +Tags: +Slug: machine-code-generation +Lang: en +Summary: + A summary of the lessons I learned about machine code generation during my + work at Lightening, Hex0 and all my recent research on compilers. + + +Machine code generation sounded like a weird magic to me half a year ago, I +swear, but now it doesn't look so disturbingly complicated. *Nothing in +computer science is that complicated, after all*. + + +1. [Basics](#basics) + 1. [Machine code is numbers](#what) + 1. [Demonstration](#demo) + 3. [Calling convention](#talk) + 4. [Memory protections](#protections) + 5. [Just-in-Time Compilation](#jit) + 1. [Example: Lightening: Guile's machine code generation + library](#lightening) +2. [Lessons learned](#problems) + 1. [Problem: Large immediates](#large-imm) + 1. [Solution: multiple instruction expansion](#multi-inst) + 2. [Solution: constant insertion](#constants) + 2. [Problem: Unknown addresses and offsets](#addr-off) + 1. [Solution: relocations](#relocs) + 1. [Example: C compilers](#relocs-c) + 2. [Example: Lightening](#relocs-lightening) + 3. [Problem: Long jumps](#jumps) + 1. [Solution: always insert the largest jump possible](#always-largest) + 1. [Optimization: pointer relaxation](#relaxation) + 2. [Example: relaxed global variable access in C + compilers](#relax-example) + 2. [Solution: Veneers](#veneer) + 1. [Example: Lightening's veneer system](#lightening-veneer) + 4. [Problem: Register Access](#reg-access) + 1. [Solution: use the stack](#stack) + 2. [Solution: controlled register access](#controlled-regs) +3. [Final thoughts](#final) + +### Basics {#basics} + +There are many contexts where you may need to generate machine code. If you are +writing a compiler, an assembler, a jit compiler... In the last months I've +been working on Lightening, a machine code generation library that powers +Guile's JIT Compilation, a RISC-V assembler and interpreter and Hex0, which was +introduced in the [previous post](https://ekaitz.elenq.tech/hex0.html), where I +needed to assemble a file by hand. + +All of those cases result in the same thing, even if they have different +conditions: we are generating machine code. + +In this post I'll try to talk about some issues that are generic and apply to +all the cases and others that are more specific to some of the projects I +mention. + +But first we need to clarify some stuff just in case. + +#### Machine code is numbers {#what} + +> Machine code is what people from the electronics world call "code". + +I know you know it, but let's refresh some things about computing we may have +forgotten thanks to all the efforts that [hide the +complexity](https://ekaitz.elenq.tech/hiding-the-complexity.html) of our +everyday business. + +Machine code instructions are basically blocks of bits your processor is +reading and interpreting. Those bit blocks encode all the information the +processor needs: the identifier of the instruction and its arguments. + +The identifier is normally known as *opcode*. The arguments can have many +different meanings, depending on the instruction so we are not getting into +that. The instructions normally alter the values of registers, so they need to +have identifiers for the source and destination registers, or literal values +that are introduced literally inside of the instruction (they are called +*immediates*). + +Let's put a simple RISC-V example here. Consider this assembly instruction: + + ::asm + addi a0, zero, 56 + +This thing you interpret as some assembly instruction that adds `56` to the +`zero` register and stores the result in the `a0` register, has to be encoded +in a way that the machine is able to understand. Better said, it is encoded in +a way that **you** can understand! The real instruction is a bunch of bits that +represent the same thing. + +RISC-V base ISA has a various instruction formats which depend on the goal of +the instruction. This one is from the `I` format, because it includes an +*immediate*. Read it and compare with the following: + +- First the *opcode*, `addi` for you, has a binary counterpart: `0010011`. 7 + bits for this instruction format. +- Then the destination register, `a0`, has an binary representation: `01010`. + There are 32 registers in RISC-V so each of them are represented by a 5 bit + value. +- There's some extra space for an opcode-like field called `funct3`: `000` +- Then there the source register, `zero`, which is: `00000`. Again 5 bits. +- And the *immediate* you are adding, `56`, which is just the binary + representation of `56`: `000000111000`. It's 12 bits wide for this + instruction format. + +Putting all together: + + ::asm + 000000111000 | 00000 | 000 | 01010 | 0010011 + +So this forms the following binary value: + + ::asm + 00000011100000000000010100010011 + +Or in hex: + + ::asm + 0x3800513 + +> Just for if you didn't realize, you just assembled an instruction by hand. + +That instruction we just created is going to be processed by the machine, +reading each of the fields and activating its circuits as it needs according to +the voltage levels those values represent. + +In this case, it's going to activate the ALU to add the numbers and all +that kind of things, but in other cases it may just change the value of the +program counter or whatever. All this is executed by the circuitry of the +device, right after it loads the instruction. + +That's for the machine, but for us, from the perspective of a programmer, +instructions are just numbers, as we just saw. + + +##### Demonstration {#demo} + +I purposely used *machine* to refer to the device that runs our instructions, +but we have to be more specific about it now. + +I'm going to talk specifically about modern (and common) microprocessors, +because other devices may have peculiarities that can sidetrack us too +hard[^harvard]. + +[^harvard]: One of those peculiarities is the [Harvard + Architecture](https://en.wikipedia.org/wiki/Harvard_architecture) that is not + going to let us make the fantastic trick I'm going to show you now. Harvard + Architecture is popular on microcontrollers. + +In our modern and common microprocessor, [instructions are located in the +memory](https://en.wikipedia.org/wiki/Von_Neumann_architecture). But that's +nothing we didn't know! If we run a binary it's loaded in the memory and +executed from there. We all know that! + +But you can be surprised to certain level if we stretch that a little bit. + +Well, we know from the previous part that instructions are just numbers, and we +know that they loaded from the memory so let's do some C black magic and see +what happens: + + ::clike + #include<stdint.h> + #include<stdio.h> + + typedef int f0(void); + + int main(int argc, char* argv[]){ + uint32_t instructions[2]; + + instructions[0] = 0x03800513; // addi a0, zero, 56 + instructions[1] = 0x00008067; // jalr zero, ra, 0 + + f0 *load_56 = (f0*) instructions; // Reinterpret the array address + // as a function + int a = load_56(); + printf("%d\n", a); + } + +In that example we build an array of two values. The first one corresponds to +the instruction we encoded by had before and the second corresponds to `jalr +zero, ra, 0`, the return instruction, which you can encode yourself. + +After that we convert the address of the array to a function that returns and +integer and... Boom! We execute the array of numbers. + +The code only works on RISC-V, but don't worry, I can tell you that it prints +`56`. + +So it was true that the machine can execute stuff from the memory, but what we +may not know is that for the machine there's no actual distinction between +instructions and data[^lisp]. We just executed an array of numbers! + +[^lisp]: LISPers are always right. + +The machine doesn't care. If it looks like instructions it executes. + +You can try to put random values in the array and try to execute them, too. +An `Illegal instruction` error is going to happen, probably. If you are lucky +you may execute something by accident, who knows. + +But how did this thing work that well? Why did it return the value correctly +and all that? + +#### Calling convention {#talk} + +The code worked because we are following the RISC-V ABI, the same that C is +following in the example. It tells us how do we need to pass arguments to +functions and return and all that. This part of the ABI that defines how to +call and return from functions is called *calling convention*. + +I'm not going to extend a lot talking about this, but I will just say that +RISC-V has some registers to pass arguments on: `a0`, `a1`...`a7`. And those +registers are also used for return values. + +In the example we are not taking any argument so we don't need to read from any +but we return one value, just writing it in `a0`. + +With what you know, you can now create a function that gets an input argument +and adds an *immediate* to it. Why don't you try? + +On the other hand. RISC-V ABI defines there's a register called `ra` that +contains the Return Address, so we need to jump to it if we want to finish our +function execution. + +There are many things more you can read about, but this is enough to begin. + + +#### Memory protections {#protections} + +The C example where we executed an array is correct, it runs and all that, but +the reality is that memory has different kinds of permissions for each part of +it. + +Code in memory is normally read only and executable, and data can be read-only +or not, depending on the goal it has (constant or variable). + +If you think about the example above, once the array is set, we can overwrite +it later, or even write it from the instructions we inserted on it. This could +led to security issues or unexpected results. That's why code is normally read +only and any attempt to write it will raise an exception to the kernel. + +There are several ways to identify a memory block as code: the RISC-V assembly +(and many others) uses the `.text` directive which automatically sets the block +as a read-only block that can be executed; the `mmap` Linux system call needs +some flags to indicate the protections on the memory block (`PROT_EXEC`, +`PROT_READ`, `PROT_WRITE`...); etc. + + +#### Just-in-Time Compilation {#jit} + +Just-in-time (JIT) Compilation is a way to execute programs that involve a +compilation step at runtime. Typically this happens on interpreted programs, +where the interpreter consumes part of the execution time. An interpreter with +a JIT Compilation feature is able to compile parts of the code it's going to +run to machine code and speed up the execution of those parts. + +Clever interpreters are able to predict if the time they need to compile and +execute the JIT Compiled parts is less than the time they need to interpret +them, so they can decide if it's worth the effort. + +Normally, the JIT Compilation is more effective in pieces of code that are +executed many times because the code only needs to be compiled once and the +speed increase is going to be obtained in every execution. But many algorithms +may be defined, and parts of the code may be recompiled looking for different +optimizations while the interpreter collects data about the performance of the +program. + +Explained like this it looks like it's a complex thing to do (and it is) but +with the previously mentioned points we can imagine a simple JIT machine code +generation library. We "just" need to: + +- Know what code to generate (choose a function to compile, this step may need + some code analysis). +- Reserve some space (`malloc`, `mmap`...) +- Fill the space with numbers (the machine code instructions resulting from the + compilation of the function). +- Next time the program wants to call the function we compiled, call the + numbers instead (as we did [in the demonstration](#demo)). + +##### Example: Lightening, Guile's machine code generation library {#lightening} + +The just-in-time compilation process in Guile is simple, but effective[^guile]. +Guile uses a library called Lightening for it. Lightening is a template-like +library that defines a virtual instruction set. That instruction set is +translated by the library to the instruction set of the actual machine. + +Implementing support for another architecture is as simple as implementing all +the translation code for the new architecture. That's [what I've been doing +these days](https://ekaitz.elenq.tech/lightening.html). + +Guile's JIT compiler only needs to call the instructions of the library and +they will generate actual machine code by themselves, packaged in a function +the interpreter will be able to call later. + +Lightening is simple because it doesn't need to compile from source code, or +make code analysis to find which part of the code does it need to compile. It +just exposes an API that looks like an instruction set and that's the thing +that we translate to the machine code. + +The JIT is going to call the API of Lightening, creating more complex +operations by combining Lightening's instructions and Lightening is going to +convert those operations to their machine code by a simple translation, filling +the array of numbers and returning its address as a function pointer we can +call later. + +Of course, it is much more complex than that because it needs to solve many +other problems we are talking about later but that's the idea. And the idea +doesn't sound too difficult, once you have in mind what we talked about +previously. + +[^guile]: You can read more about [how it works here][guile-jit]. + +[guile-jit]: https://www.gnu.org/software/guile/manual/html_node/Just_002dIn_002dTime-Native-Code.html + + +### Lessons learned {#problems} + +There are many problems that a machine code generation library like that can +encounter, but they are not exclusive for those libraries. These kind of +problems can also appear in compilers, assemblers and many other things. + +The lessons I learned come as problems I encountered during these days of +digging and implementing and some possible solutions or thoughts about them. + +#### Problem: Large immediates {#large-imm} + +Large *immediates* are one of the most obvious but boring issues in this world, +and they apply to many cases. + +In the [example above](#what) we encoded an `addi` instruction that added `56`, +an *immediate*, to a register, and we said the *immediate* had a 12 bit space +in the instruction. Registers in RISC-V are 32 bit (in RV32) or 64 bit (in +RV64) wide, so we can work with larger values, but we are limited to use `12` +bit immediates in `addi` and all the other `I`-type instructions. + +Why is that? Well, RISC-V instructions are 32 bit and they need to be able to +pack much more information than the *immediate* they use, so the *immediates* +can't be as large as we want. The fixed instruction size is a design decision +that keeps the processor simple, but other processors have other design +decisions[^x86] around this. + +[^x86]: In x86 all the instructions don't have the same length and some can + encode larger *immediates*. + +##### Solution: multiple instruction expansion {#multi-inst} + +There are several solutions for this, but the most obvious one is to use more +than one instruction to operate in an immediate. + +If we want to load a 64 bit value, we can add, rotate left, add, rotate left, +add... Until we fill a whole register with the value we were looking for. + +This means a simple addition can be expanded to many instructions. In some +cases they are going to be just a few, but as the *immediates* get big we may +need more than eight instructions, very well encoded, to write the immediate to +a register and be able to operate with it. + +##### Solution: constant insertion {#constants} + +This is not a solution we can take everywhere, but we can take in the context +we are right now (code is stored in the memory and all that, remember). +Consider this RV64 code: + + ::clike + auipc t0, 0 // x[t0] = PC + 0 + ld t0, 12(t0) // x[t0] = mem[ x[t0] + 12 ] + jal zero, 3 // PC = PC + 3 + 0xAAAAAAAAAAAAAAAA // This is a 64 bit literal + addi t0, t0, 1 // x[t0] = x[t0] + 1 + // What's the value of t0 here? + +The code has some comments in the right that I'm going to use through the whole +post, so get used to them. The `x` means register access (base registers are +called X registers in RISC-V), and `mem` is memory, `PC` (program counter) is +written in uppercase and not as if it was a register because it's not +accessible by the programmer so we need to treat it as a global variable we can +only set using jumps or get using `auipc`. + +RISC-V instructions are 32 bit long (4 bytes), so you can get what the offset +in the `ld` instruction does, right? + +Basically we are loading a doubleword (`ld`) at the position of the +`0xAAAAAAAAAAAAAAAA` in the `t0` register and adding `1` to it. So the answer +to the question is `0xAAAAAAAAAAAAAAAB`. + +But can you see the trick we are using? + +The `jal` instruction is jumping over the constant so we can't execute it by +accident (which will cause an `Illegal Instruction` error), and using the `ld` +instruction we are able to load a big constant to a register. A constant which +**is mixed with the code**, as any immediate would be, but without being +associated with any instruction. + +If we know the code we are generating is a function, we can always wait until +the return instruction and insert all the constants after it, so they are +perfectly separated and we don't insert jumps to avoid executing the constants +by accident. For that case, we need to change the values of the `auipc` and the +`ld` accordingly, making them point to the correct address, which has some +associated issues we need to talk about now. + + +--- + +<div style=" + text-align: center; + font-size: smaller; + padding-left: 3em; + padding-right: 3em; + padding-top: 1em; + padding-bottom: 1em; + border-top: 1px solid var(--border-color); + border-bottom: 1px solid var(--border-color)"> +Keep in mind you can hire <a href="https://elenq.tech">ElenQ +Technology</a> if you like this kind of material. <br/> +We teach with this mixture of passion and awkward charisma. We also code and +research. +</div> + +--- + + +#### Problem: Unknown addresses and offsets {#addr-off} + +Addresses and offsets are a pain in the ass because you may don't know them +when you expect to. + +Let's consider an unconditional jump like the one of the previous example. The +number we introduce is the amount of instructions to jump from the program +counter: an offset. The *immediate* offset can be positive, for forward jumps, +or negative, for backward jumps. + + ::clike + jal zero, 3 // PC = PC + 3 + +Generating this jump supposes that you know where you need to jump: you want to +jump 3 instructions to *the future*. + +But imagine you are assembling a file, a real assembly file that is not an +oversimplification of the assembly, like what we did in the previous example. A +real assembly file with *labels*: + + ::clike + add a0, a0, t1 // I don't care about this instruction + j end // Unconditional jump to `end` + + // Some code here + + end: // Label `end` + ret // return + +If you are assembling this file line by line, you can actually assemble the +`add` in the first line, because you know *everything* from it, but you are +unable to emit the `j end` because you don't know where `end` is *yet*. + +If this assembly is written in a file you can always preprocess the whole file, +get the labels, associate them with their addresses and then assemble the whole +thing, but you are not always in this situation. + +Lightening, for instance, generates the code as you call the API, so it doesn't +know where does your jump point to until you call the API for the label later. + +Compilers may encounter this issue too, when they are using separate +compilation and linking steps. You must be able to compile one source file by +your own but you may not know where do global variables appear, because they +might be in a different file, and you only know those at link time. + +##### Solution: relocations {#relocs} + +There's one simple way to solve it: introduce a fake offset or address and +patch it later, when we know the position of the symbol. That's what +relocations do. + +###### Example: C compilers {#relocs-c} + +The relocations are a mechanism to pass information between the compiler and +the linker, you can actually see them in the object files generated by your +compiler. Make a simple file with a global variable and compile it. Something +like this: + + ::clike + int global_symbol; + int main(int argc, char* argv[]){ + return global_symbol !=0; + } + +If you compile it with `gcc -c`, you can inspect relocations in the result with +`objdump`, using the `-r` flag alongside with `-d` for disassemble. In RISC-V +you'll find things like `R_RISCV_HI20` or `R_RISCV_LO12` where the relocations +are located. They are ways to encode *immediates* in `U`-type instructions and +`I`-type instructions respectively. In my case I get something like this (it's +not the full result): + + ::clike + 6: 00000797 auipc a5,0x0 + 6: R_RISCV_PCREL_HI20 global_symbol + 6: R_RISCV_RELAX *ABS* + a: 00078793 addi a5,a5,0x0 + a: R_RISCV_PCREL_LO12_I .L0 + a: R_RISCV_RELAX *ABS* + e: 639c ld a5,0(a5) + +There are two types of relocations but we are going to talk about the +`R_RISCV_RELAX` later. You see my relocations have `PCREL` in the middle, but +just to mention they are relative to the program counter. + +If you just inspect the binary with the `-d` you won't see the relocations and +the result will look like nonsense code[^nonsense-code]: + + ::clike + 6: 00000797 auipc a5,0x0 // x[a5] = PC + 0 + a: 00078793 addi a5,a5,0x0 // x[a5] = x[a5] + 0 + e: 639c ld a5,0(a5) // x[a5] = mem[ x[a5] + 0 ] + +This adds `0` to program counter and stores the result in `a5`, then adds `0` +to `a5`, and loads a doubleword to `a5` from the address at `a5`. But the +address at `a5` at the moment of the load is nothing but the program counter at +the `auipc` instruction. Weird. + +The relocation is going to point to the `auipc` and the `addi`, and tell the +linker it has to replace the zeros by other value. Which one? The address of +the global variable. If we replace the zeros by a combination that is able to +load the address of the global variable the code will work. That's what the +relocation does here. + +So, as we don't know where to point, we insert anything (zeros) and we fix the +instructions when we know where do they need to point to. + +[^nonsense-code]: `addi a5,a5, 0x0` is adding 0 to `a5` and storing it in `a5` + so it's just moving it. RISC-V has a pseudoinstruction for that: `mv a5,a5`, + which is expanded to the `addi`. `objdump` is going to write `mv` in its + output, because it tries to be clever, but that is not the goal of the + instruction we have. I changed it to the actual instruction so we can + understand this better. + +###### Example: Lightening {#relocs-lightening} + +Same approach is followed in Lightening, and you can follow in your assembler, +library or anything that has a similar problem. Let's consider some code using +Lightening (obtained from `tests/beqr.c`, comments on my own): + + ::clike + // Make a function that loads two arguments + jit_load_args_2(j, jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R0), + jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R1)); + + jit_reloc_t r = jit_beqr(j, JIT_R0, JIT_R1); // branch if equal registers + jit_leave_jit_abi(j, 0, 0, align); // end ABI context + jit_reti(j, 0); // return 0 + jit_patch_here(j, r); // make the branch jump here + jit_leave_jit_abi(j, 0, 0, align); // end ABI context + jit_reti(j, 1); // return 1 + + // Obtain the function we created + jit_word_t (*f)(jit_word_t, jit_word_t) = jit_end(j, NULL); + + // Test if it works + ASSERT(f(0, 0) == 1); // 0 == 0 so it jumps -> returns 1 + ASSERT(f(0, 1) == 0); // 0 != 1 so it doesn't jump -> returns 0 + +In this example we see how we generate machine code statement by statement, so +there's no way to know where does the `beqr` need to jump until we generated +all the code for it. + +You see the `beqr` function doesn't receive the target address or offset as an +argument, but it returns a `jit_reloc_t`, which other functions like `reti` +don't return. + +That `jit_reloc_t` is what we are patching later with the `jit_patch_here` to +tell where does it need to jump. The `jit_patch_here` function is going to +correct the bits we set to zero because we didn't know the target at that +moment. + +There are different kinds of relocations, as it happened in the previous +example with the compilers, because different instruction formats need to be +patched in different ways. In the case of Lightening, the relocation has a type +associated with it, so we can check and act accordingly. + + +#### Problem: Long jumps {#jumps} + +As we saw, some jumps encode the target as an *immediate*. This has a couple of +implications that we just described in previously: + +- The jump target could be larger than the space we have for the immediate. +- Sometimes we can't know the target until we reach the position where the jump + points to. + +Both issues can be combined together in a killer combo. Consider this code: + + ::clike + j label // jump to label + + // A lot of instructions here + + label: + // this is the target of the jump + +In RISC-V the `j` pseudoinstruction is resolved to `jal`, that has a `21` bit +(signed) space for the jump target. If we had a hella lot of instructions +between the jump and the target we may need more bits for the jump than the +space we actually have. + +Again, in the case were we can preprocess everything there's no problem, but if +we are assembling the instructions as they come we are going to have issues. +We realize we can't jump that far too late, because we already inserted +a 21 bit jump and too many instructions when we reach the label. Patching the +jump is not enough, because we didn't leave enough space to insert the offset +we need. + +##### Solution: always insert the largest jump possible {#always-largest} + +There's an obvious solution: always insert the largest possible jump and patch +the whole jump later. + +In RISC-V `jalr` jumps to the absolute address that is stored on a register +with an optional 12 bit (signed) offset. Combined with the `auipc` (add upper +immediate to program counter) it lets us make 32 bit relative jumps in just 2 +instructions. Let's explain that in code just in case: + + ::clike + auipc t0, offset_hi // x[t0] = PC + (offset_hi<<12) + jalr zero, offset_lo(t0) // PC = x[t0] + offset_lo + +If we consider `offset` as a value we know, we can split it in two blocks: the +highest 20 bits as `offset_hi` and the lowest 12 bits as `offset_low` and use +them to jump to any address in the 32 range from the current position, using +just 2 instructions. + +In 32 bit machines, this jump is the largest jump possible, because the machine +can only address 32 bits, so we will be sure that any relative (or absolute, +using `lui` instead of `auipc`) jump we want to make can fit in place. The only +thing we have to take in account is to patch both instructions when we find the +targets, not only one. + +###### Optimization: pointer relaxation {#relaxation} + +But using the largest possible jumps can led to inefficiencies because we use +two instructions for jumps that can potentially fit in just one. + +We can use something we saw before for that: relocations. More specifically, +in the case of the GCC toolchain, we can use the `R_RISCV_RELAX` that appeared +before. + +The relaxation relocation is going to tell the next step, which can be the +linker or anything else depending on the context we are working on, that the +pointer can be relaxed. In the case of the `auipc` + `jalr`, possibly by +replacing both instructions by a 21 bit jump like `jal`. + +So we start with the longest jump possible, but when we actually know the +target of the jump, we can reduce it to something smaller that needs fewer +instructions. + +###### Example: relaxed global variable access in C compilers {#relax-example} + +Global variables, as we saw before, are some of those points where compilers +need to use relocations and let the linker clean the result. + +Global variables don't necessarily involve jumps but they do involve pointers +for the loads and stores needed to operate with them. In the final executables, +global variables are part of the `.data` segment, because they are known at +compilation time, so we can exploit that fact a little and relax our weird +`auipc` + *load/store* combos. + +RISC-V has many registers, so we can use them for things that may not be the +norm in other platforms where registers are scarce. In this case, we can +exploit the `gp` (global pointer) register on RISC-V to improve the access to +the global variables. We can cache the address of the `.data` segment of the +program in the `gp` register so, as we know most of the global variables are +going to be near (12 bit offset) to the beginning of the `.data` segment, we +are probably going to be able to remove some of the `auipc`s we inserted +before. + +So a simple load of a global 64bit variable to a register: + + ::clike + auipc t0, offset_to_global_hi // x[t0] = PC + offset_to_global_hi << 12 + ld t0, offset_to_global_lo(t0) // x[t0] = mem[ x[t0] + offset_to_global_lo ] + +Is optimized to this: + + ::clike + ld t0, offset_from_data(gp) // x[t0] = mem[ x[gp] + offset_from_data ] + +Of course, the offsets have to be calculated and all that, but it's not that +difficult. + + +##### Solution: Veneers {#veneer} + +There are other solutions that don't involve messing around with the code we +generated earlier in an aggressive way like removing instructions, which can be +pretty bad because you have to shift the array of instructions you generated to +clean the gaps the pointer relaxation leaves. + +Veneers are non destructive, and they involve no instruction reorganization, so +they are interesting for those cases where you need to generate the code as you +go. + +Let's explain them with an example: + + ::clike + beq a0, a1, branch // Jump to `branch` if x[a0] == x[a1] + + // Instructions... + + branch: + // Branch target + +As we saw previously, if we insert too many instructions in between the jump +and the target we screw it. What we didn't mention is that as we go assembling +instructions one by one we can keep a track of the amount of instructions we +are inserting. + +Having that in mind, we can take decisions in time, right before it's too late. +We can combine that knowledge with the [constant insertion](#constants) method +introduced before to insert full-range jumps if needed, right before we exhaust +the possible offset of the original instruction. + +Of course, we need to patch the original instruction to jump to the code we are +just going to insert, and we need to add some protections around the veneer to +make it only accessible to the original jump. + + ::clike + beq a0, a1, veneer // Jump to `veneer` if x[a0] == x[a1] + + // Many instructions, but not too many! + + // Here we realize we are running out of offset range so we insert a helper + // block that lets us jump further. + j avoid // Jump to `avoid` so the normal execution flow + // doesn't fall in the veneer + veneer: + auipc t0,0 // x[t0] = PC + 0 + ld t0,12(t0) // x[t0] = mem[ x[t0] + 12 ] + jalr zero,0(t0) // PC = x[t0] + ADDRESS(branch) // Literal address of `branch` label + avoid: + + // As many instructions as we want + + branch: + // Branch target + +As it happened with constant insertion, there are positions where the veneer +insertion can be optimized a little, like right after a return or an +unconditional jump, so we don't need the protection (`j avoid` in the example). + +The bad thing about veneers is they insert a bunch of instructions in the cases +that are not in range and the jumps are done in two steps, that has a negative +effect on the performance because they drop the pre-processed instructions [in +the pipeline](https://en.wikipedia.org/wiki/Instruction_pipelining). + +Of course, the veneers themselves have to be patched too, because we won't know +the target (`branch` in the example) until we reach it. But, in the case of the +veneer we can be 100% sure that we are going to be able to point to the target. + +###### Example: Lightening's constant pools {#lightening-veneer} + +Lightening uses veneers for the jumps[^veneer-if-needed], but they are part of +Lightening's constant pool mechanism. Constant pools work the same for the +constant insertion than for veneers, because veneers are basically constants. +Remember, code is numbers! + +[^veneer-if-needed]: Only in the architectures that need them. `x86` does not + need constant pools or veneers because the ISA is complex enough to handle + the problematic cases adding levels of complexity ISAs like RISC-V or ARM + didn't want to deal with. RISC vs CISC, y'know... + +Basically anything that might be inserted as a constant, which can be a veneer +or just a number or whatever, is queued to the constant pool. The library is +going to emit instructions and check on each instruction if it needs to emit +any of the constants of the pool. + +The constant pool and each of the entries on the pool have associated +information that tells the emitter if they need to be emitted now or if they +can wait for later so the emitter can decide to insert them. + +The literal pool entries have, of course, an associated relocation that +contains information of the original jump or load instructions we may need to +patch, as we already saw. So, in the case of a veneer emission, we need to +patch the original jump to the veneer and remember the veneer needs to be +patched later, when we find its target. + +The mechanism is not complex, but it's not simple neither. There are several +kinds of relocations, depending on what we want to do with them, different kind +of patches we need to do, address calculations and all those things that +require a level of attention to the detail I'm not prepared to talk about. + +#### Problem: Register access {#reg-access} + +You may have seen a problematic point in some of the solutions we worked with: +we are using registers. + +It's not a problem by itself, but using registers might be really problematic +if we are inserting code between the instructions someone else wrote because we +can't control the register use the original program did and we might be +changing the values inside of the registers in the magic tricks we sneakily +inserted. + +Imagine we use, say, `t0` register in our veneer but the original program uses +that register for something else. That's a problem. We are messing with the +value in the register and potentially (surely) breaking the program. + +##### Solution: use the stack {#stack} + +The most obvious solution you can think of is to use the stack. We can surround +our veneers or code insertions with some protection code that saves the values +of the registers on the stack and restores them when we finished. + +It's a simple solution in your mind, but if you need to deal with the jumps it +can get messy. You may need to restore the register far away in the code and +keeping track of everything. It can be complicated. + +On the other hand, memory access is slow and boring and we don like that kind +of things in our lives. We need more dynamite. + +##### Solution: controlled register access {#controlled-regs} + +The other solution we can provide is to keep control of the registers that are +being accessed and use others for our intervention. + +A simple way to do this is to provide functions to get and release temporary +registers, instead of letting the programmers do whatever they want. This makes +sure that all the register access is controlled and we are not changing the +values of any register in use. + +The main problem we can have comes when the programmer needs all the register +for their things and then we can't really use any for our magic tricks. But we +can always keep at least one register for us and only for us (throwing an error +to the programmer when they use it) or even combine the use of the stack with +this solution. + +If we are directly working with assembly code, where we can't force the +programmer to use the interface we want, we can choose the solutions that don't +involve register access so we don't need to analyze the code to deduce if the +programmer is using the registers or not. Avoiding the problem is sometimes the +best solution. + +In the case of libraries like Lightening register access control is a must +because Lightening can't control how its (virtual-) instructions are translated +to machine code instructions: each machine has its own peculiarities and +details. In many cases they need to make use of temporary registers and, as the +instructions are built incrementally, preventing instructions from peeing on +each other is important. + +--- + +<div style=" + text-align: center; + font-size: smaller; + padding-left: 3em; + padding-right: 3em; + padding-top: 1em; + padding-bottom: 1em; + border-top: 1px solid var(--border-color); + border-bottom: 1px solid var(--border-color)"> +Please, consider supporting me on <a href="https://buymeacoffee.com/ekaitz">Buy +Me a Coffee</a> or on <a href="https://liberapay.com/ekaitz">Liberapay</a> to +encourage my free software work. +</div> + +--- + +### Final thoughts {#final} + +I know these are just a few things, but they are enough to let you make you +first program that involves machine code generation to certain level. + +I'm not a computer scientist but a telecommunication engineer[^engineer], so I +may put the focus on things that are obvious for the average reader of these +kind of posts, but at the same time I may be flying over things that I +consider basic due to my studies but the average reader doesn't. In any case, +feel free to [contact me](https://ekaitz.elenq.tech/pages/about.html) if you +have questions or corrections. + +Some of the tricks and lessons I included here are more important than others, +but the most important thing is to start thinking in these terms. Try to +understand the problems you face when you have separate compilation, assume +the fact that you can't know the future... The mindset is the most important +point of all this and, once you have it, everything comes easier. + +It's also a lot of fun to realize code is just numbers in memory you can mess +around with. I hope you keep it in your brain forever. + +I hope this post throws some light on this dark hole that machine code +generation is and makes you try to make your own findings on this beautiful +area of compilers, machines and extremely long blog entries. + + +[^engineer]: So, for all that software developers that write blog posts like + "Are we really engineers?" or stuff like that: **I am**, thanks for the + interest. LOL |