From 7e7a956916899a2f35a22ec12ae08cd56e12e52f Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Sat, 26 Jun 2021 22:54:53 +0200
Subject: Add machine code generation entry

---
 content/machine-code-generation.md        | 895 ++++++++++++++++++++++++++++++
 pelicanconf.py                            |   2 +
 themes/elenq/static/css/elenq-pelican.css |   5 +
 themes/elenq/static/css/elenq.css         |   6 +-
 4 files changed, 905 insertions(+), 3 deletions(-)
 create mode 100644 content/machine-code-generation.md

diff --git a/content/machine-code-generation.md b/content/machine-code-generation.md
new file mode 100644
index 0000000..ebf2fe2
--- /dev/null
+++ b/content/machine-code-generation.md
@@ -0,0 +1,895 @@
+Title: Lessons learned on machine code generation
+Date: 2021-06-16
+Category:
+Tags:
+Slug: machine-code-generation
+Lang: en
+Summary:
+    A summary of the lessons I learned about machine code generation during my
+    work at Lightening, Hex0 and all my recent research on compilers.
+
+
+Machine code generation sounded like a weird magic to me half a year ago, I
+swear, but now it doesn't look so disturbingly complicated. *Nothing in
+computer science is that complicated, after all*.
+
+
+1. [Basics](#basics)
+    1. [Machine code is numbers](#what)
+        1. [Demonstration](#demo)
+    3. [Calling convention](#talk)
+    4. [Memory protections](#protections)
+    5. [Just-in-Time Compilation](#jit)
+        1. [Example: Lightening: Guile's machine code generation
+           library](#lightening)
+2. [Lessons learned](#problems)
+    1. [Problem: Large immediates](#large-imm)
+        1. [Solution: multiple instruction expansion](#multi-inst)
+        2. [Solution: constant insertion](#constants)
+    2. [Problem: Unknown addresses and offsets](#addr-off)
+        1. [Solution: relocations](#relocs)
+            1. [Example: C compilers](#relocs-c)
+            2. [Example: Lightening](#relocs-lightening)
+    3. [Problem: Long jumps](#jumps)
+        1. [Solution: always insert the largest jump possible](#always-largest)
+            1. [Optimization: pointer relaxation](#relaxation)
+            2. [Example: relaxed global variable access in C
+               compilers](#relax-example)
+        2. [Solution: Veneers](#veneer)
+            1. [Example: Lightening's veneer system](#lightening-veneer)
+    4. [Problem: Register Access](#reg-access)
+        1. [Solution: use the stack](#stack)
+        2. [Solution: controlled register access](#controlled-regs)
+3. [Final thoughts](#final)
+
+### Basics {#basics}
+
+There are many contexts where you may need to generate machine code. If you are
+writing a compiler, an assembler, a jit compiler... In the last months I've
+been working on Lightening, a machine code generation library that powers
+Guile's JIT Compilation, a RISC-V assembler and interpreter and Hex0, which was
+introduced in the [previous post](https://ekaitz.elenq.tech/hex0.html), where I
+needed to assemble a file by hand.
+
+All of those cases result in the same thing, even if they have different
+conditions: we are generating machine code.
+
+In this post I'll try to talk about some issues that are generic and apply to
+all the cases and others that are more specific to some of the projects I
+mention.
+
+But first we need to clarify some stuff just in case.
+
+#### Machine code is numbers {#what}
+
+> Machine code is what people from the electronics world call "code".
+
+I know you know it, but let's refresh some things about computing we may have
+forgotten thanks to all the efforts that [hide the
+complexity](https://ekaitz.elenq.tech/hiding-the-complexity.html) of our
+everyday business.
+
+Machine code instructions are basically blocks of bits your processor is
+reading and interpreting. Those bit blocks encode all the information the
+processor needs: the identifier of the instruction and its arguments.
+
+The identifier is normally known as *opcode*. The arguments can have many
+different meanings, depending on the instruction so we are not getting into
+that. The instructions normally alter the values of registers, so they need to
+have identifiers for the source and destination registers, or literal values
+that are introduced literally inside of the instruction (they are called
+*immediates*).
+
+Let's put a simple RISC-V example here. Consider this assembly instruction:
+
+    ::asm
+    addi a0, zero, 56
+
+This thing you interpret as some assembly instruction that adds `56` to the
+`zero` register and stores the result in the `a0` register, has to be encoded
+in a way that the machine is able to understand. Better said, it is encoded in
+a way that **you** can understand! The real instruction is a bunch of bits that
+represent the same thing.
+
+RISC-V base ISA has a various instruction formats which depend on the goal of
+the instruction. This one is from the `I` format, because it includes an
+*immediate*. Read it and compare with the following:
+
+- First the *opcode*, `addi` for you, has a binary counterpart: `0010011`. 7
+  bits for this instruction format.
+- Then the destination register, `a0`, has an binary representation: `01010`.
+  There are 32 registers in RISC-V so each of them are represented by a 5 bit
+  value.
+- There's some extra space for an opcode-like field called `funct3`: `000`
+- Then there the source register, `zero`, which is: `00000`. Again 5 bits.
+- And the *immediate* you are adding, `56`, which is just the binary
+  representation of `56`: `000000111000`. It's 12 bits wide for this
+  instruction format.
+
+Putting all together:
+
+    ::asm
+    000000111000 | 00000 | 000 | 01010 | 0010011
+
+So this forms the following binary value:
+
+    ::asm
+    00000011100000000000010100010011
+
+Or in hex:
+
+    ::asm
+    0x3800513
+
+> Just for if you didn't realize, you just assembled an instruction by hand.
+
+That instruction we just created is going to be processed by the machine,
+reading each of the fields and activating its circuits as it needs according to
+the voltage levels those values represent.
+
+In this case, it's going to activate the ALU to add the numbers and all
+that kind of things, but in other cases it may just change the value of the
+program counter or whatever.  All this is executed by the circuitry of the
+device, right after it loads the instruction.
+
+That's for the machine, but for us, from the perspective of a programmer,
+instructions are just numbers, as we just saw.
+
+
+##### Demonstration {#demo}
+
+I purposely used *machine* to refer to the device that runs our instructions,
+but we have to be more specific about it now.
+
+I'm going to talk specifically about modern (and common) microprocessors,
+because other devices may have peculiarities that can sidetrack us too
+hard[^harvard].
+
+[^harvard]: One of those peculiarities is the [Harvard
+  Architecture](https://en.wikipedia.org/wiki/Harvard_architecture) that is not
+  going to let us make the fantastic trick I'm going to show you now. Harvard
+  Architecture is popular on microcontrollers.
+
+In our modern and common microprocessor, [instructions are located in the
+memory](https://en.wikipedia.org/wiki/Von_Neumann_architecture). But that's
+nothing we didn't know! If we run a binary it's loaded in the memory and
+executed from there. We all know that!
+
+But you can be surprised to certain level if we stretch that a little bit.
+
+Well, we know from the previous part that instructions are just numbers, and we
+know that they loaded from the memory so let's do some C black magic and see
+what happens:
+
+    ::clike
+    #include<stdint.h>
+    #include<stdio.h>
+
+    typedef int f0(void);
+
+    int main(int argc, char* argv[]){
+        uint32_t instructions[2];
+
+        instructions[0] = 0x03800513; // addi a0, zero, 56
+        instructions[1] = 0x00008067; // jalr zero, ra, 0
+
+        f0 *load_56 = (f0*) instructions; // Reinterpret the array address
+                                          // as a function
+        int a = load_56();
+        printf("%d\n", a);
+    }
+
+In that example we build an array of two values. The first one corresponds to
+the instruction we encoded by had before and the second corresponds to `jalr
+zero, ra, 0`, the return instruction, which you can encode yourself.
+
+After that we convert the address of the array to a function that returns and
+integer and... Boom! We execute the array of numbers.
+
+The code only works on RISC-V, but don't worry, I can tell you that it prints
+`56`.
+
+So it was true that the machine can execute stuff from the memory, but what we
+may not know is that for the machine there's no actual distinction between
+instructions and data[^lisp]. We just executed an array of numbers!
+
+[^lisp]: LISPers are always right.
+
+The machine doesn't care. If it looks like instructions it executes.
+
+You can try to put random values in the array and try to execute them, too.
+An `Illegal instruction` error is going to happen, probably. If you are lucky
+you may execute something by accident, who knows.
+
+But how did this thing work that well? Why did it return the value correctly
+and all that?
+
+#### Calling convention {#talk}
+
+The code worked because we are following the RISC-V ABI, the same that C is
+following in the example. It tells us how do we need to pass arguments to
+functions and return and all that. This part of the ABI that defines how to
+call and return from functions is called *calling convention*.
+
+I'm not going to extend a lot talking about this, but I will just say that
+RISC-V has some registers to pass arguments on: `a0`, `a1`...`a7`. And those
+registers are also used for return values.
+
+In the example we are not taking any argument so we don't need to read from any
+but we return one value, just writing it in `a0`.
+
+With what you know, you can now create a function that gets an input argument
+and adds an *immediate* to it. Why don't you try?
+
+On the other hand. RISC-V ABI defines there's a register called `ra` that
+contains the Return Address, so we need to jump to it if we want to finish our
+function execution.
+
+There are many things more you can read about, but this is enough to begin.
+
+
+#### Memory protections {#protections}
+
+The C example where we executed an array is correct, it runs and all that, but
+the reality is that memory has different kinds of permissions for each part of
+it.
+
+Code in memory is normally read only and executable, and data can be read-only
+or not, depending on the goal it has (constant or variable).
+
+If you think about the example above, once the array is set, we can overwrite
+it later, or even write it from the instructions we inserted on it. This could
+led to security issues or unexpected results. That's why code is normally read
+only and any attempt to write it will raise an exception to the kernel.
+
+There are several ways to identify a memory block as code: the RISC-V assembly
+(and many others) uses the `.text` directive which automatically sets the block
+as a read-only block that can be executed; the `mmap` Linux system call needs
+some flags to indicate the protections on the memory block (`PROT_EXEC`,
+`PROT_READ`, `PROT_WRITE`...); etc.
+
+
+#### Just-in-Time Compilation {#jit}
+
+Just-in-time (JIT) Compilation is a way to execute programs that involve a
+compilation step at runtime. Typically this happens on interpreted programs,
+where the interpreter consumes part of the execution time. An interpreter with
+a JIT Compilation feature is able to compile parts of the code it's going to
+run to machine code and speed up the execution of those parts.
+
+Clever interpreters are able to predict if the time they need to compile and
+execute the JIT Compiled parts is less than the time they need to interpret
+them, so they can decide if it's worth the effort.
+
+Normally, the JIT Compilation is more effective in pieces of code that are
+executed many times because the code only needs to be compiled once and the
+speed increase is going to be obtained in every execution. But many algorithms
+may be defined, and parts of the code may be recompiled looking for different
+optimizations while the interpreter collects data about the performance of the
+program.
+
+Explained like this it looks like it's a complex thing to do (and it is) but
+with the previously mentioned points we can imagine a simple JIT machine code
+generation library. We "just" need to:
+
+- Know what code to generate (choose a function to compile, this step may need
+  some code analysis).
+- Reserve some space (`malloc`, `mmap`...)
+- Fill the space with numbers (the machine code instructions resulting from the
+  compilation of the function).
+- Next time the program wants to call the function we compiled, call the
+  numbers instead (as we did [in the demonstration](#demo)).
+
+##### Example: Lightening, Guile's machine code generation library {#lightening}
+
+The just-in-time compilation process in Guile is simple, but effective[^guile].
+Guile uses a library called Lightening for it. Lightening is a template-like
+library that defines a virtual instruction set. That instruction set is
+translated by the library to the instruction set of the actual machine.
+
+Implementing support for another architecture is as simple as implementing all
+the translation code for the new architecture. That's [what I've been doing
+these days](https://ekaitz.elenq.tech/lightening.html).
+
+Guile's JIT compiler only needs to call the instructions of the library and
+they will generate actual machine code by themselves, packaged in a function
+the interpreter will be able to call later.
+
+Lightening is simple because it doesn't need to compile from source code, or
+make code analysis to find which part of the code does it need to compile. It
+just exposes an API that looks like an instruction set and that's the thing
+that we translate to the machine code.
+
+The JIT is going to call the API of Lightening, creating more complex
+operations by combining Lightening's instructions and Lightening is going to
+convert those operations to their machine code by a simple translation, filling
+the array of numbers and returning its address as a function pointer we can
+call later.
+
+Of course, it is much more complex than that because it needs to solve many
+other problems we are talking about later but that's the idea. And the idea
+doesn't sound too difficult, once you have in mind what we talked about
+previously.
+
+[^guile]: You can read more about [how it works here][guile-jit].
+
+[guile-jit]: https://www.gnu.org/software/guile/manual/html_node/Just_002dIn_002dTime-Native-Code.html
+
+
+### Lessons learned {#problems}
+
+There are many problems that a machine code generation library like that can
+encounter, but they are not exclusive for those libraries. These kind of
+problems can also appear in compilers, assemblers and many other things.
+
+The lessons I learned come as problems I encountered during these days of
+digging and implementing and some possible solutions or thoughts about them.
+
+#### Problem: Large immediates {#large-imm}
+
+Large *immediates* are one of the most obvious but boring issues in this world,
+and they apply to many cases.
+
+In the [example above](#what) we encoded an `addi` instruction that added `56`,
+an *immediate*, to a register, and we said the *immediate* had a 12 bit space
+in the instruction. Registers in RISC-V are 32 bit (in RV32) or 64 bit (in
+RV64) wide, so we can work with larger values, but we are limited to use `12`
+bit immediates in `addi` and all the other `I`-type instructions.
+
+Why is that? Well, RISC-V instructions are 32 bit and they need to be able to
+pack much more information than the *immediate* they use, so the *immediates*
+can't be as large as we want. The fixed instruction size is a design decision
+that keeps the processor simple, but other processors have other design
+decisions[^x86] around this.
+
+[^x86]: In x86 all the instructions don't have the same length and some can
+  encode larger *immediates*.
+
+##### Solution: multiple instruction expansion {#multi-inst}
+
+There are several solutions for this, but the most obvious one is to use more
+than one instruction to operate in an immediate.
+
+If we want to load a 64 bit value, we can add, rotate left, add, rotate left,
+add... Until we fill a whole register with the value we were looking for.
+
+This means a simple addition can be expanded to many instructions. In some
+cases they are going to be just a few, but as the *immediates* get big we may
+need more than eight instructions, very well encoded, to write the immediate to
+a register and be able to operate with it.
+
+##### Solution: constant insertion {#constants}
+
+This is not a solution we can take everywhere, but we can take in the context
+we are right now (code is stored in the memory and all that, remember).
+Consider this RV64 code:
+
+    ::clike
+    auipc t0, 0             // x[t0] = PC + 0
+    ld t0, 12(t0)           // x[t0] = mem[ x[t0] + 12 ]
+    jal zero, 3             // PC    = PC + 3
+    0xAAAAAAAAAAAAAAAA      // This is a 64 bit literal
+    addi t0, t0, 1          // x[t0] = x[t0] + 1
+    // What's the value of t0 here?
+
+The code has some comments in the right that I'm going to use through the whole
+post, so get used to them. The `x` means register access (base registers are
+called X registers in RISC-V), and `mem` is memory, `PC` (program counter) is
+written in uppercase and not as if it was a register because it's not
+accessible by the programmer so we need to treat it as a global variable we can
+only set using jumps or get using `auipc`.
+
+RISC-V instructions are 32 bit long (4 bytes), so you can get what the offset
+in the `ld` instruction does, right?
+
+Basically we are loading a doubleword (`ld`) at the position of the
+`0xAAAAAAAAAAAAAAAA` in the `t0` register and adding `1` to it. So the answer
+to the question is `0xAAAAAAAAAAAAAAAB`.
+
+But can you see the trick we are using?
+
+The `jal` instruction is jumping over the constant so we can't execute it by
+accident (which will cause an `Illegal Instruction` error), and using the `ld`
+instruction we are able to load a big constant to a register. A constant which
+**is mixed with the code**, as any immediate would be, but without being
+associated with any instruction.
+
+If we know the code we are generating is a function, we can always wait until
+the return instruction and insert all the constants after it, so they are
+perfectly separated and we don't insert jumps to avoid executing the constants
+by accident. For that case, we need to change the values of the `auipc` and the
+`ld` accordingly, making them point to the correct address, which has some
+associated issues we need to talk about now.
+
+
+---
+
+<div style="
+    text-align: center;
+    font-size: smaller;
+    padding-left: 3em;
+    padding-right: 3em;
+    padding-top: 1em;
+    padding-bottom: 1em;
+    border-top: 1px solid var(--border-color);
+    border-bottom: 1px solid var(--border-color)">
+Keep in mind you can hire <a href="https://elenq.tech">ElenQ
+Technology</a> if you like this kind of material. <br/>
+We teach with this mixture of passion and awkward charisma. We also code and
+research.
+</div>
+
+---
+
+
+#### Problem: Unknown addresses and offsets {#addr-off}
+
+Addresses and offsets are a pain in the ass because you may don't know them
+when you expect to.
+
+Let's consider an unconditional jump like the one of the previous example. The
+number we introduce is the amount of instructions to jump from the program
+counter: an offset. The *immediate* offset can be positive, for forward jumps,
+or negative, for backward jumps.
+
+    ::clike
+    jal zero, 3             // PC    = PC + 3
+
+Generating this jump supposes that you know where you need to jump: you want to
+jump 3 instructions to *the future*.
+
+But imagine you are assembling a file, a real assembly file that is not an
+oversimplification of the assembly, like what we did in the previous example. A
+real assembly file with *labels*:
+
+    ::clike
+    add a0, a0, t1     // I don't care about this instruction
+    j end              // Unconditional jump to `end`
+
+    // Some code here
+
+    end:               // Label `end`
+        ret            // return
+
+If you are assembling this file line by line, you can actually assemble the
+`add` in the first line, because you know *everything* from it, but you are
+unable to emit the `j end` because you don't know where `end` is *yet*.
+
+If this assembly is written in a file you can always preprocess the whole file,
+get the labels, associate them with their addresses and then assemble the whole
+thing, but you are not always in this situation.
+
+Lightening, for instance, generates the code as you call the API, so it doesn't
+know where does your jump point to until you call the API for the label later.
+
+Compilers may encounter this issue too, when they are using separate
+compilation and linking steps. You must be able to compile one source file by
+your own but you may not know where do global variables appear, because they
+might be in a different file, and you only know those at link time. 
+
+##### Solution: relocations {#relocs}
+
+There's one simple way to solve it: introduce a fake offset or address and
+patch it later, when we know the position of the symbol. That's what
+relocations do.
+
+###### Example: C compilers {#relocs-c}
+
+The relocations are a mechanism to pass information between the compiler and
+the linker, you can actually see them in the object files generated by your
+compiler. Make a simple file with a global variable and compile it. Something
+like this:
+
+    ::clike
+    int global_symbol;
+    int main(int argc, char* argv[]){
+        return global_symbol !=0;
+    }
+
+If you compile it with `gcc -c`, you can inspect relocations in the result with
+`objdump`, using the `-r` flag alongside with `-d` for disassemble.  In RISC-V
+you'll find things like `R_RISCV_HI20` or `R_RISCV_LO12` where the relocations
+are located. They are ways to encode *immediates* in `U`-type instructions and
+`I`-type instructions respectively.  In my case I get something like this (it's
+not the full result):
+
+    ::clike
+       6:       00000797                auipc   a5,0x0
+                6: R_RISCV_PCREL_HI20   global_symbol
+                6: R_RISCV_RELAX        *ABS*
+       a:       00078793                addi    a5,a5,0x0
+                a: R_RISCV_PCREL_LO12_I .L0 
+                a: R_RISCV_RELAX        *ABS*
+       e:       639c                    ld      a5,0(a5)
+
+There are two types of relocations but we are going to talk about the
+`R_RISCV_RELAX` later. You see my relocations have `PCREL` in the middle, but
+just to mention they are relative to the program counter.
+
+If you just inspect the binary with the `-d` you won't see the relocations and
+the result will look like nonsense code[^nonsense-code]:
+
+    ::clike
+        6:       00000797   auipc   a5,0x0         // x[a5] = PC + 0
+        a:       00078793   addi    a5,a5,0x0      // x[a5] = x[a5] + 0
+        e:       639c       ld      a5,0(a5)       // x[a5] = mem[ x[a5] + 0 ]
+
+This adds `0` to program counter and stores the result in `a5`, then adds `0`
+to `a5`, and loads a doubleword to `a5` from the address at `a5`. But the
+address at `a5` at the moment of the load is nothing but the program counter at
+the `auipc` instruction.  Weird.
+
+The relocation is going to point to the `auipc` and the `addi`, and tell the
+linker it has to replace the zeros by other value. Which one? The address of
+the global variable. If we replace the zeros by a combination that is able to
+load the address of the global variable the code will work. That's what the
+relocation does here.
+
+So, as we don't know where to point, we insert anything (zeros) and we fix the
+instructions when we know where do they need to point to.
+
+[^nonsense-code]: `addi a5,a5, 0x0` is adding 0 to `a5` and storing it in `a5`
+  so it's just moving it. RISC-V has a pseudoinstruction for that: `mv a5,a5`,
+  which is expanded to the `addi`. `objdump` is going to write `mv` in its
+  output, because it tries to be clever, but that is not the goal of the
+  instruction we have. I changed it to the actual instruction so we can
+  understand this better.
+
+###### Example: Lightening {#relocs-lightening}
+
+Same approach is followed in Lightening, and you can follow in your assembler,
+library or anything that has a similar problem. Let's consider some code using
+Lightening (obtained from `tests/beqr.c`, comments on my own):
+    
+    ::clike
+    // Make a function that loads two arguments
+    jit_load_args_2(j, jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R0),
+                    jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R1));
+    
+    jit_reloc_t r = jit_beqr(j, JIT_R0, JIT_R1); // branch if equal registers
+    jit_leave_jit_abi(j, 0, 0, align);           // end ABI context
+    jit_reti(j, 0);                              // return 0
+    jit_patch_here(j, r);                        // make the branch jump here
+    jit_leave_jit_abi(j, 0, 0, align);           // end ABI context
+    jit_reti(j, 1);                              // return 1
+
+    // Obtain the function we created
+    jit_word_t (*f)(jit_word_t, jit_word_t) = jit_end(j, NULL);
+
+    // Test if it works
+    ASSERT(f(0, 0) == 1);       // 0 == 0 so it jumps        -> returns 1
+    ASSERT(f(0, 1) == 0);       // 0 != 1 so it doesn't jump -> returns 0
+
+In this example we see how we generate machine code statement by statement, so
+there's no way to know where does the `beqr` need to jump until we generated
+all the code for it.
+
+You see the `beqr` function doesn't receive the target address or offset as an
+argument, but it returns a `jit_reloc_t`, which other functions like `reti`
+don't return.
+
+That `jit_reloc_t` is what we are patching later with the `jit_patch_here` to
+tell where does it need to jump. The `jit_patch_here` function is going to
+correct the bits we set to zero because we didn't know the target at that
+moment.
+
+There are different kinds of relocations, as it happened in the previous
+example with the compilers, because different instruction formats need to be
+patched in different ways. In the case of Lightening, the relocation has a type
+associated with it, so we can check and act accordingly.
+
+
+#### Problem: Long jumps {#jumps}
+
+As we saw, some jumps encode the target as an *immediate*. This has a couple of
+implications that we just described in previously:
+
+- The jump target could be larger than the space we have for the immediate.
+- Sometimes we can't know the target until we reach the position where the jump
+  points to.
+
+Both issues can be combined together in a killer combo. Consider this code:
+
+    ::clike
+    j label     // jump to label
+
+    // A lot of instructions here
+
+    label:
+        // this is the target of the jump
+
+In RISC-V the `j` pseudoinstruction is resolved to `jal`, that has a `21` bit
+(signed) space for the jump target. If we had a hella lot of instructions
+between the jump and the target we may need more bits for the jump than the
+space we actually have.
+
+Again, in the case were we can preprocess everything there's no problem, but if
+we are assembling the instructions as they come we are going to have issues.
+We realize we can't jump that far too late, because we already inserted
+a 21 bit jump and too many instructions when we reach the label. Patching the
+jump is not enough, because we didn't leave enough space to insert the offset
+we need.
+
+##### Solution: always insert the largest jump possible {#always-largest}
+
+There's an obvious solution: always insert the largest possible jump and patch
+the whole jump later.
+
+In RISC-V `jalr` jumps to the absolute address that is stored on a register
+with an optional 12 bit (signed) offset. Combined with the `auipc` (add upper
+immediate to program counter) it lets us make 32 bit relative jumps in just 2
+instructions. Let's explain that in code just in case:
+
+    ::clike
+    auipc t0, offset_hi         // x[t0] = PC + (offset_hi<<12)
+    jalr zero, offset_lo(t0)    // PC    = x[t0] + offset_lo
+
+If we consider `offset` as a value we know, we can split it in two blocks: the
+highest 20 bits as `offset_hi` and the lowest 12 bits as `offset_low` and use
+them to jump to any address in the 32 range from the current position, using
+just 2 instructions.
+
+In 32 bit machines, this jump is the largest jump possible, because the machine
+can only address 32 bits, so we will be sure that any relative (or absolute,
+using `lui` instead of `auipc`) jump we want to make can fit in place. The only
+thing we have to take in account is to patch both instructions when we find the
+targets, not only one.
+
+###### Optimization: pointer relaxation {#relaxation}
+
+But using the largest possible jumps can led to inefficiencies because we use
+two instructions for jumps that can potentially fit in just one.
+
+We can use something we saw before for that: relocations.  More specifically,
+in the case of the GCC toolchain, we can use the `R_RISCV_RELAX` that appeared
+before.
+
+The relaxation relocation is going to tell the next step, which can be the
+linker or anything else depending on the context we are working on, that the
+pointer can be relaxed. In the case of the `auipc` + `jalr`, possibly by
+replacing both instructions by a 21 bit jump like `jal`.
+
+So we start with the longest jump possible, but when we actually know the
+target of the jump, we can reduce it to something smaller that needs fewer
+instructions.
+
+###### Example: relaxed global variable access in C compilers {#relax-example}
+
+Global variables, as we saw before, are some of those points where compilers
+need to use relocations and let the linker clean the result.
+
+Global variables don't necessarily involve jumps but they do involve pointers
+for the loads and stores needed to operate with them. In the final executables,
+global variables are part of the `.data` segment, because they are known at
+compilation time, so we can exploit that fact a little and relax our weird
+`auipc` + *load/store* combos.
+
+RISC-V has many registers, so we can use them for things that may not be the
+norm in other platforms where registers are scarce. In this case, we can
+exploit the `gp` (global pointer) register on RISC-V to improve the access to
+the global variables. We can cache the address of the `.data` segment of the
+program in the `gp` register so, as we know most of the global variables are
+going to be near (12 bit offset) to the beginning of the `.data` segment, we
+are probably going to be able to remove some of the `auipc`s we inserted
+before.
+
+So a simple load of a global 64bit variable to a register:
+
+    ::clike
+    auipc t0, offset_to_global_hi  // x[t0] = PC + offset_to_global_hi << 12
+    ld t0, offset_to_global_lo(t0) // x[t0] = mem[ x[t0] + offset_to_global_lo ]
+
+Is optimized to this:
+
+    ::clike
+    ld t0, offset_from_data(gp)    // x[t0] = mem[ x[gp] + offset_from_data ]
+
+Of course, the offsets have to be calculated and all that, but it's not that
+difficult.
+
+
+##### Solution: Veneers {#veneer}
+
+There are other solutions that don't involve messing around with the code we
+generated earlier in an aggressive way like removing instructions, which can be
+pretty bad because you have to shift the array of instructions you generated to
+clean the gaps the pointer relaxation leaves.
+
+Veneers are non destructive, and they involve no instruction reorganization, so
+they are interesting for those cases where you need to generate the code as you
+go.
+
+Let's explain them with an example:
+
+    ::clike
+    beq a0, a1, branch // Jump to `branch` if x[a0] == x[a1]
+
+    // Instructions...
+
+    branch:
+    // Branch target
+
+As we saw previously, if we insert too many instructions in between the jump
+and the target we screw it. What we didn't mention is that as we go assembling
+instructions one by one we can keep a track of the amount of instructions we
+are inserting.
+
+Having that in mind, we can take decisions in time, right before it's too late.
+We can combine that knowledge with the [constant insertion](#constants) method
+introduced before to insert full-range jumps if needed, right before we exhaust
+the possible offset of the original instruction.
+
+Of course, we need to patch the original instruction to jump to the code we are
+just going to insert, and we need to add some protections around the veneer to
+make it only accessible to the original jump.
+
+    ::clike
+    beq a0, a1, veneer      // Jump to `veneer` if x[a0] == x[a1]
+
+    // Many instructions, but not too many!
+
+    // Here we realize we are running out of offset range so we insert a helper
+    // block that lets us jump further.
+    j avoid                 // Jump to `avoid` so the normal execution flow
+                            // doesn't fall in the veneer
+    veneer:
+        auipc t0,0          // x[t0] = PC + 0
+        ld t0,12(t0)        // x[t0] = mem[ x[t0] + 12 ]
+        jalr zero,0(t0)     // PC    = x[t0]
+        ADDRESS(branch)     // Literal address of `branch` label
+    avoid:
+
+    // As many instructions as we want
+
+    branch:
+    // Branch target
+
+As it happened with constant insertion, there are positions where the veneer
+insertion can be optimized a little, like right after a return or an
+unconditional jump, so we don't need the protection (`j avoid` in the example).
+
+The bad thing about veneers is they insert a bunch of instructions in the cases
+that are not in range and the jumps are done in two steps, that has a negative
+effect on the performance because they drop the pre-processed instructions [in
+the pipeline](https://en.wikipedia.org/wiki/Instruction_pipelining).
+
+Of course, the veneers themselves have to be patched too, because we won't know
+the target (`branch` in the example) until we reach it. But, in the case of the
+veneer we can be 100% sure that we are going to be able to point to the target.
+
+###### Example: Lightening's constant pools {#lightening-veneer}
+
+Lightening uses veneers for the jumps[^veneer-if-needed], but they are part of
+Lightening's constant pool mechanism. Constant pools work the same for the
+constant insertion than for veneers, because veneers are basically constants.
+Remember, code is numbers!
+
+[^veneer-if-needed]: Only in the architectures that need them. `x86` does not
+  need constant pools or veneers because the ISA is complex enough to handle
+  the problematic cases adding levels of complexity ISAs like RISC-V or ARM
+  didn't want to deal with. RISC vs CISC, y'know...
+
+Basically anything that might be inserted as a constant, which can be a veneer
+or just a number or whatever, is queued to the constant pool. The library is
+going to emit instructions and check on each instruction if it needs to emit
+any of the constants of the pool.
+
+The constant pool and each of the entries on the pool have associated
+information that tells the emitter if they need to be emitted now or if they
+can wait for later so the emitter can decide to insert them.
+
+The literal pool entries have, of course, an associated relocation that
+contains information of the original jump or load instructions we may need to
+patch, as we already saw. So, in the case of a veneer emission, we need to
+patch the original jump to the veneer and remember the veneer needs to be
+patched later, when we find its target.
+
+The mechanism is not complex, but it's not simple neither. There are several
+kinds of relocations, depending on what we want to do with them, different kind
+of patches we need to do, address calculations and all those things that
+require a level of attention to the detail I'm not prepared to talk about.
+
+#### Problem: Register access {#reg-access}
+
+You may have seen a problematic point in some of the solutions we worked with:
+we are using registers.
+
+It's not a problem by itself, but using registers might be really problematic
+if we are inserting code between the instructions someone else wrote because we
+can't control the register use the original program did and we might be
+changing the values inside of the registers in the magic tricks we sneakily
+inserted.
+
+Imagine we use, say, `t0` register in our veneer but the original program uses
+that register for something else. That's a problem. We are messing with the
+value in the register and potentially (surely) breaking the program.
+
+##### Solution: use the stack {#stack}
+
+The most obvious solution you can think of is to use the stack. We can surround
+our veneers or code insertions with some protection code that saves the values
+of the registers on the stack and restores them when we finished.
+
+It's a simple solution in your mind, but if you need to deal with the jumps it
+can get messy. You may need to restore the register far away in the code and
+keeping track of everything. It can be complicated.
+
+On the other hand, memory access is slow and boring and we don like that kind
+of things in our lives. We need more dynamite.
+
+##### Solution: controlled register access {#controlled-regs}
+
+The other solution we can provide is to keep control of the registers that are
+being accessed and use others for our intervention.
+
+A simple way to do this is to provide functions to get and release temporary
+registers, instead of letting the programmers do whatever they want. This makes
+sure that all the register access is controlled and we are not changing the
+values of any register in use.
+
+The main problem we can have comes when the programmer needs all the register
+for their things and then we can't really use any for our magic tricks. But we
+can always keep at least one register for us and only for us (throwing an error
+to the programmer when they use it) or even combine the use of the stack with
+this solution.
+
+If we are directly working with assembly code, where we can't force the
+programmer to use the interface we want, we can choose the solutions that don't
+involve register access so we don't need to analyze the code to deduce if the
+programmer is using the registers or not. Avoiding the problem is sometimes the
+best solution.
+
+In the case of libraries like Lightening register access control is a must
+because Lightening can't control how its (virtual-) instructions are translated
+to machine code instructions: each machine has its own peculiarities and
+details. In many cases they need to make use of temporary registers and, as the
+instructions are built incrementally, preventing instructions from peeing on
+each other is important.
+
+---
+
+<div style="
+    text-align: center;
+    font-size: smaller;
+    padding-left: 3em;
+    padding-right: 3em;
+    padding-top: 1em;
+    padding-bottom: 1em;
+    border-top: 1px solid var(--border-color);
+    border-bottom: 1px solid var(--border-color)">
+Please, consider supporting me on <a href="https://buymeacoffee.com/ekaitz">Buy
+Me a Coffee</a> or on <a href="https://liberapay.com/ekaitz">Liberapay</a> to
+encourage my free software work.
+</div>
+
+---
+
+### Final thoughts {#final}
+
+I know these are just a few things, but they are enough to let you make you
+first program that involves machine code generation to certain level.
+
+I'm not a computer scientist but a telecommunication engineer[^engineer], so I
+may put the focus on things that are obvious for the average reader of these
+kind of posts, but at the same time I may be flying over things that I
+consider basic due to my studies but the average reader doesn't. In any case,
+feel free to [contact me](https://ekaitz.elenq.tech/pages/about.html) if you
+have questions or corrections.
+
+Some of the tricks and lessons I included here are more important than others,
+but the most important thing is to start thinking in these terms. Try to
+understand the problems you face when you have separate compilation, assume
+the fact that you can't know the future... The mindset is the most important
+point of all this and, once you have it, everything comes easier.
+
+It's also a lot of fun to realize code is just numbers in memory you can mess
+around with. I hope you keep it in your brain forever.
+
+I hope this post throws some light on this dark hole that machine code
+generation is and makes you try to make your own findings on this beautiful
+area of compilers, machines and extremely long blog entries.
+
+
+[^engineer]: So, for all that software developers that write blog posts like
+  "Are we really engineers?" or stuff like that: **I am**, thanks for the
+  interest. LOL
diff --git a/pelicanconf.py b/pelicanconf.py
index bac800c..e8fcc4f 100644
--- a/pelicanconf.py
+++ b/pelicanconf.py
@@ -60,6 +60,8 @@ MENUITEMS = [("Home" , "/index.html"), ("Series", "/tags.html")]
 # Markdown extras
 MARKDOWN = {
     'extension_configs': {
+        'markdown.extensions.attr_list': {},
+        'markdown.extensions.fenced_code': {},
         'markdown.extensions.footnotes': {},
         'markdown.extensions.tables': {},
         'markdown.extensions.codehilite': {
diff --git a/themes/elenq/static/css/elenq-pelican.css b/themes/elenq/static/css/elenq-pelican.css
index 4ca2b77..17bc9cd 100644
--- a/themes/elenq/static/css/elenq-pelican.css
+++ b/themes/elenq/static/css/elenq-pelican.css
@@ -67,6 +67,10 @@ li{
 h2.entry-title{
   font-size: 3.0rem; line-height: 1.2;
 }
+h3.entry-title{
+  line-height: 1.2;
+  margin-bottom: 1ex;
+}
 
 h1{
   text-align: left;
@@ -88,6 +92,7 @@ h1 span.subtitle{
   font-size: 2.5rem;
 }
 
+
 .footnote{
   margin-top: 4em;
   border-top: 1px solid var(--border-color);
diff --git a/themes/elenq/static/css/elenq.css b/themes/elenq/static/css/elenq.css
index ba04f2f..c51af6f 100644
--- a/themes/elenq/static/css/elenq.css
+++ b/themes/elenq/static/css/elenq.css
@@ -18,10 +18,10 @@ q, blockquote{
 }
 h1, h2, h3, h4, h5, h6 {
   text-decoration: none;
-  margin-top: 2ex;
-  margin-bottom: 0;
+  margin-top: 3ex;
+  margin-bottom: 1.5ex;
   line-height: 2;
-  font-weight: 300; }
+}
 h1 {
   font-size: 5.5rem;
   line-height: 1.2;
-- 
cgit v1.2.3