summaryrefslogtreecommitdiff
path: root/content/machine-code-generation.md
blob: 470937d3c44ad2dfa3f5b13fa6d6cb25417c0bbd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
Title: Lessons learned on machine code generation
Date: 2021-06-16
Category:
Tags:
Slug: machine-code-generation
Lang: en
Summary:
    A summary of the lessons I learned about machine code generation during my
    work at Lightening, Hex0 and all my recent research on compilers.


Machine code generation sounded like a weird magic to me half a year ago, I
swear, but now it doesn't look so disturbingly complicated. *Nothing in
computer science is that complicated, after all*.


1. [Basics](#basics)
    1. [Machine code is numbers](#what)
        1. [Demonstration](#demo)
    3. [Calling convention](#talk)
    4. [Memory protections](#protections)
    5. [Just-in-Time Compilation](#jit)
        1. [Example: Lightening: Guile's machine code generation
           library](#lightening)
2. [Lessons learned](#problems)
    1. [Problem: Large immediates](#large-imm)
        1. [Solution: multiple instruction expansion](#multi-inst)
        2. [Solution: constant insertion](#constants)
    2. [Problem: Unknown addresses and offsets](#addr-off)
        1. [Solution: relocations](#relocs)
            1. [Example: C compilers](#relocs-c)
            2. [Example: Lightening](#relocs-lightening)
    3. [Problem: Long jumps](#jumps)
        1. [Solution: always insert the largest jump possible](#always-largest)
            1. [Optimization: pointer relaxation](#relaxation)
            2. [Example: relaxed global variable access in C
               compilers](#relax-example)
        2. [Solution: Veneers](#veneer)
            1. [Example: Lightening's veneer system](#lightening-veneer)
    4. [Problem: Register Access](#reg-access)
        1. [Solution: use the stack](#stack)
        2. [Solution: controlled register access](#controlled-regs)
3. [Final thoughts](#final)

### Basics {#basics}

There are many contexts where you may need to generate machine code. If you are
writing a compiler, an assembler, a jit compiler... In the last months I've
been working on Lightening, a machine code generation library that powers
Guile's JIT Compilation, a RISC-V assembler and interpreter and Hex0, which was
introduced in the [previous post](https://ekaitz.elenq.tech/hex0.html), where I
needed to assemble a file by hand.

All of those cases result in the same thing, even if they have different
conditions: we are generating machine code.

In this post I'll try to talk about some issues that are generic and apply to
all the cases and others that are more specific to some of the projects I
mention.

But first we need to clarify some stuff just in case.

#### Machine code is numbers {#what}

> Machine code is what people from the electronics world call "code".

I know you know it, but let's refresh some things about computing we may have
forgotten thanks to all the efforts that [hide the
complexity](https://ekaitz.elenq.tech/hiding-the-complexity.html) of our
everyday business.

Machine code instructions are basically blocks of bits your processor is
reading and interpreting. Those bit blocks encode all the information the
processor needs: the identifier of the instruction and its arguments.

The identifier is normally known as *opcode*. The arguments can have many
different meanings, depending on the instruction so we are not getting into
that. The instructions normally alter the values of registers, so they need to
have identifiers for the source and destination registers, or literal values
that are introduced literally inside of the instruction (they are called
*immediates*).

Let's put a simple RISC-V example here. Consider this assembly instruction:

    ::asm
    addi a0, zero, 56

This thing you interpret as some assembly instruction that adds `56` to the
`zero` register and stores the result in the `a0` register, has to be encoded
in a way that the machine is able to understand. Better said, it is encoded in
a way that **you** can understand! The real instruction is a bunch of bits that
represent the same thing.

RISC-V base ISA has a various instruction formats which depend on the goal of
the instruction. This one is from the `I` format, because it includes an
*immediate*. Read it and compare with the following:

- First the *opcode*, `addi` for you, has a binary counterpart: `0010011`. 7
  bits for this instruction format.
- Then the destination register, `a0`, has a binary representation: `01010`.
  There are 32 registers in RISC-V so each of them are represented by a 5 bit
  value.
- There's some extra space for an opcode-like field called `funct3`: `000`
- Then there the source register, `zero`, which is: `00000`. Again 5 bits.
- And the *immediate* you are adding, `56`, which is just the binary
  representation of `56`: `000000111000`. It's 12 bits wide for this
  instruction format.

Putting all together:

    ::asm
    000000111000 | 00000 | 000 | 01010 | 0010011

So this forms the following binary value:

    ::asm
    00000011100000000000010100010011

Or in hex:

    ::asm
    0x3800513

> Just for if you didn't realize, you just assembled an instruction by hand.

That instruction we just created is going to be processed by the machine,
reading each of the fields and activating its circuits as it needs according to
the voltage levels those values represent.

In this case, it's going to activate the ALU to add the numbers and all
that kind of things, but in other cases it may just change the value of the
program counter or whatever.  All this is executed by the circuitry of the
device, right after it loads the instruction.

That's for the machine, but for us, from the perspective of a programmer,
instructions are just numbers, as we just saw.


##### Demonstration {#demo}

I purposely used *machine* to refer to the device that runs our instructions,
but we have to be more specific about it now.

I'm going to talk specifically about modern (and common) microprocessors,
because other devices may have peculiarities that can sidetrack us too
hard[^harvard].

[^harvard]: One of those peculiarities is the [Harvard
  Architecture](https://en.wikipedia.org/wiki/Harvard_architecture) that is not
  going to let us make the fantastic trick I'm going to show you now. Harvard
  Architecture is popular on microcontrollers.

In our modern and common microprocessor, [instructions are located in the
memory](https://en.wikipedia.org/wiki/Von_Neumann_architecture). But that's
nothing we didn't know! If we run a binary it's loaded in the memory and
executed from there. We all know that!

But you can be surprised to certain level if we stretch that a little bit.

Well, we know from the previous part that instructions are just numbers, and we
know that they loaded from the memory so let's do some C black magic and see
what happens:

    ::clike
    #include<stdint.h>
    #include<stdio.h>

    typedef int f0(void);

    int main(int argc, char* argv[]){
        uint32_t instructions[2];

        instructions[0] = 0x03800513; // addi a0, zero, 56
        instructions[1] = 0x00008067; // jalr zero, ra, 0

        f0 *load_56 = (f0*) instructions; // Reinterpret the array address
                                          // as a function
        int a = load_56();
        printf("%d\n", a);
    }

In that example we build an array of two values. The first one corresponds to
the instruction we encoded by hand before and the second corresponds to `jalr
zero, ra, 0`, the return instruction, which you can encode yourself.

After that we convert the address of the array to a function that returns and
integer and... Boom! We execute the array of numbers.

The code only works on RISC-V, but don't worry, I can tell you that it prints
`56`.

So it was true that the machine can execute stuff from the memory, but what we
may not know is that for the machine there's no actual distinction between
instructions and data[^lisp]. We just executed an array of numbers!

[^lisp]: LISPers are always right.

The machine doesn't care. If it looks like instructions it executes.

You can try to put random values in the array and try to execute them, too.
An `Illegal instruction` error is going to happen, probably. If you are lucky
you may execute something by accident, who knows.

But how did this thing work that well? Why did it return the value correctly
and all that?

#### Calling convention {#talk}

The code worked because we are following the RISC-V ABI, the same that C is
following in the example. It tells us how do we need to pass arguments to
functions and return and all that. This part of the ABI that defines how to
call and return from functions is called *calling convention*.

I'm not going to extend a lot talking about this, but I will just say that
RISC-V has some registers to pass arguments on: `a0`, `a1`...`a7`. And those
registers are also used for return values.

In the example we are not taking any argument so we don't need to read from any
but we return one value, just writing it in `a0`.

With what you know, you can now create a function that gets an input argument
and adds an *immediate* to it. Why don't you try?

On the other hand. RISC-V ABI defines there's a register called `ra` that
contains the Return Address, so we need to jump to it if we want to finish our
function execution.

There are many things more you can read about, but this is enough to begin.


#### Memory protections {#protections}

The C example where we executed an array is correct, it runs and all that, but
the reality is that memory has different kinds of permissions for each part of
it.

Code in memory is normally read-only and executable, and data can be read-only
or not, depending on the goal it has (constant or variable).

If you think about the example above, once the array is set, we can overwrite
it later, or even write it from the instructions we inserted on it. This could
lead to security issues or unexpected results. That's why code is normally read
only and any attempt to write it will raise an exception to the kernel.

There are several ways to identify a memory block as code: the RISC-V assembly
(and many others) uses the `.text` directive which automatically sets the block
as a read-only block that can be executed; the `mmap` Linux system call needs
some flags to indicate the protections on the memory block (`PROT_EXEC`,
`PROT_READ`, `PROT_WRITE`...); etc.


#### Just-in-Time Compilation {#jit}

Just-in-time (JIT) Compilation is a way to execute programs that involve a
compilation step at runtime. Typically this happens on interpreted programs,
where the interpreter consumes part of the execution time. An interpreter with
a JIT Compilation feature is able to compile parts of the code it's going to
run to machine code and speed up the execution of those parts.

Clever interpreters are able to predict if the time they need to compile and
execute the JIT Compiled parts is less than the time they need to interpret
them, so they can decide if it's worth the effort.

Normally, the JIT Compilation is more effective in pieces of code that are
executed many times because the code only needs to be compiled once and the
speed increase is going to be obtained in every execution. But many algorithms
may be defined, and parts of the code may be recompiled looking for different
optimizations while the interpreter collects data about the performance of the
program.

Explained like this it looks like it's a complex thing to do (and it is) but
with the previously mentioned points we can imagine a simple JIT machine code
generation library. We "just" need to:

- Know what code to generate (choose a function to compile, this step may need
  some code analysis).
- Reserve some space (`malloc`, `mmap`...)
- Fill the space with numbers (the machine code instructions resulting from the
  compilation of the function).
- Next time the program wants to call the function we compiled, call the
  numbers instead (as we did [in the demonstration](#demo)).

##### Example: Lightening, Guile's machine code generation library {#lightening}

The just-in-time compilation process in Guile is simple, but effective[^guile].
Guile uses a library called Lightening for it. Lightening is a template-like
library that defines a virtual instruction set. That instruction set is
translated by the library to the instruction set of the actual machine.

Implementing support for another architecture is as simple as implementing all
the translation code for the new architecture. That's [what I've been doing
these days](https://ekaitz.elenq.tech/lightening.html).

Guile's JIT compiler only needs to call the instructions of the library and
they will generate actual machine code by themselves, packaged in a function
the interpreter will be able to call later.

Lightening is simple because it doesn't need to compile from source code, or
make code analysis to find which part of the code does it need to compile. It
just exposes an API that looks like an instruction set and that's the thing
that we translate to the machine code.

The JIT is going to call the API of Lightening, creating more complex
operations by combining Lightening's instructions and Lightening is going to
convert those operations to their machine code by a simple translation, filling
the array of numbers and returning its address as a function pointer we can
call later.

Of course, it is much more complex than that because it needs to solve many
other problems we are talking about later but that's the idea. And the idea
doesn't sound too difficult, once you have in mind what we talked about
previously.

[^guile]: You can read more about [how it works here][guile-jit].

[guile-jit]: https://www.gnu.org/software/guile/manual/html_node/Just_002dIn_002dTime-Native-Code.html


### Lessons learned {#problems}

There are many problems that a machine code generation library like that can
encounter, but they are not exclusive for those libraries. This kind of
problems can also appear in compilers, assemblers and many other things.

The lessons I learned come as problems I encountered during these days of
digging and implementing and some possible solutions or thoughts about them.

#### Problem: Large immediates {#large-imm}

Large *immediates* are one of the most obvious but boring issues in this world,
and they apply to many cases.

In the [example above](#what) we encoded an `addi` instruction that added `56`,
an *immediate*, to a register, and we said the *immediate* had a 12 bit space
in the instruction. Registers in RISC-V are 32 bit (in RV32) or 64 bit (in
RV64) wide, so we can work with larger values, but we are limited to use `12`
bit immediates in `addi` and all the other `I`-type instructions.

Why is that? Well, RISC-V instructions are 32 bit and they need to be able to
pack much more information than the *immediate* they use, so the *immediates*
can't be as large as we want. The fixed instruction size is a design decision
that keeps the processor simple, but other processors have other design
decisions[^x86] around this.

[^x86]: In x86 all the instructions don't have the same length and some can
  encode larger *immediates*.

##### Solution: multiple instruction expansion {#multi-inst}

There are several solutions for this, but the most obvious one is to use more
than one instruction to operate in an immediate.

If we want to load a 64 bit value, we can add, rotate left, add, rotate left,
add... Until we fill a whole register with the value we were looking for.

This means a simple addition can be expanded to many instructions. In some
cases they are going to be just a few, but as the *immediates* get big we may
need more than eight instructions, very well encoded, to write the immediate to
a register and be able to operate with it.

##### Solution: constant insertion {#constants}

This is not a solution we can take everywhere, but we can take in the context
we are right now (code is stored in the memory and all that, remember).
Consider this RV64 code:

    ::clike
    auipc t0, 0             // x[t0] = PC + 0
    ld t0, 12(t0)           // x[t0] = mem[ x[t0] + 12 ]
    jal zero, 3             // PC    = PC + 3
    0xAAAAAAAAAAAAAAAA      // This is a 64 bit literal
    addi t0, t0, 1          // x[t0] = x[t0] + 1
    // What's the value of t0 here?

The code has some comments on the right that I'm going to use through the whole
post, so get used to them. The `x` means register access (base registers are
called X registers in RISC-V), and `mem` is memory, `PC` (program counter) is
written in uppercase and not as if it was a register because it's not
accessible by the programmer so we need to treat it as a global variable we can
only set using jumps or get using `auipc`.

RISC-V instructions are 32 bit long (4 bytes), so you can get what the offset
in the `ld` instruction does, right?

Basically we are loading a doubleword (`ld`) at the position of the
`0xAAAAAAAAAAAAAAAA` in the `t0` register and adding `1` to it. So the answer
to the question is `0xAAAAAAAAAAAAAAAB`.

But can you see the trick we are using?

The `jal` instruction is jumping over the constant so we can't execute it by
accident (which would cause an `Illegal Instruction` error), and using the `ld`
instruction we are able to load a big constant to a register. A constant which
**is mixed with the code**, as any immediate would be, but without being
associated with any instruction.

If we know the code we are generating is a function, we can always wait until
the return instruction and insert all the constants after it, so they are
perfectly separated and we don't insert jumps to avoid executing the constants
by accident. For that case, we need to change the values of the `auipc` and the
`ld` accordingly, making them point to the correct address, which has some
associated issues we need to talk about now.


---

<div style="
    text-align: center;
    font-size: smaller;
    padding-left: 3em;
    padding-right: 3em;
    padding-top: 1em;
    padding-bottom: 1em;
    border-top: 1px solid var(--border-color);
    border-bottom: 1px solid var(--border-color)">
Keep in mind you can hire <a href="https://elenq.tech">ElenQ
Technology</a> if you like this kind of material. <br/>
We teach with this mixture of passion and awkward charisma. We also code and
research.
</div>

---


#### Problem: Unknown addresses and offsets {#addr-off}

Addresses and offsets are a pain in the ass because you may don't know them
when you expect to.

Let's consider an unconditional jump like the one of the previous example. The
number we introduce is the amount of instructions to jump from the program
counter: an offset. The *immediate* offset can be positive, for forward jumps,
or negative, for backward jumps.

    ::clike
    jal zero, 3             // PC    = PC + 3

Generating this jump supposes that you know where you need to jump: you want to
jump 3 instructions to *the future*.

But imagine you are assembling a file, a real assembly file that is not an
oversimplification of the assembly, like what we did in the previous example. A
real assembly file with *labels*:

    ::clike
    add a0, a0, t1     // I don't care about this instruction
    j end              // Unconditional jump to `end`

    // Some code here

    end:               // Label `end`
        ret            // return

If you are assembling this file line by line, you can actually assemble the
`add` in the first line, because you know *everything* from it, but you are
unable to emit the `j end` because you don't know where `end` is *yet*.

If this assembly is written in a file you can always preprocess the whole file,
get the labels, associate them with their addresses and then assemble the whole
thing, but you are not always in this situation.

Lightening, for instance, generates the code as you call the API, so it doesn't
know where your jump points to until you call the API for the label later.

Compilers may encounter this issue too, when they are using separate
compilation and linking steps. You must be able to compile one source file by
your own but you may not know where do global variables appear, because they
might be in a different file, and you only know those at link time. 

##### Solution: relocations {#relocs}

There's one simple way to solve it: introduce a fake offset or address and
patch it later, when we know the position of the symbol. That's what
relocations do.

###### Example: C compilers {#relocs-c}

The relocations are a mechanism to pass information between the compiler and
the linker, you can actually see them in the object files generated by your
compiler. Make a simple file with a global variable and compile it. Something
like this:

    ::clike
    int global_symbol;
    int main(int argc, char* argv[]){
        return global_symbol !=0;
    }

If you compile it with `gcc -c`, you can inspect relocations in the result with
`objdump`, using the `-r` flag alongside with `-d` for disassemble.  In RISC-V
you'll find things like `R_RISCV_HI20` or `R_RISCV_LO12` where the relocations
are located. They are ways to encode *immediates* in `U`-type instructions and
`I`-type instructions respectively.  In my case I get something like this (it's
not the full result):

    ::clike
       6:       00000797                auipc   a5,0x0
                6: R_RISCV_PCREL_HI20   global_symbol
                6: R_RISCV_RELAX        *ABS*
       a:       00078793                addi    a5,a5,0x0
                a: R_RISCV_PCREL_LO12_I .L0 
                a: R_RISCV_RELAX        *ABS*
       e:       639c                    ld      a5,0(a5)

There are two types of relocations but we are going to talk about the
`R_RISCV_RELAX` later. You see my relocations have `PCREL` in the middle, but
just to mention they are relative to the program counter.

If you just inspect the binary with the `-d` you won't see the relocations and
the result will look like nonsense code[^nonsense-code]:

    ::clike
        6:       00000797   auipc   a5,0x0         // x[a5] = PC + 0
        a:       00078793   addi    a5,a5,0x0      // x[a5] = x[a5] + 0
        e:       639c       ld      a5,0(a5)       // x[a5] = mem[ x[a5] + 0 ]

This adds `0` to program counter and stores the result in `a5`, then adds `0`
to `a5`, and loads a doubleword to `a5` from the address at `a5`. But the
address at `a5` at the moment of the load is nothing but the program counter at
the `auipc` instruction.  Weird.

The relocation is going to point to the `auipc` and the `addi`, and tell the
linker it has to replace the zeros by other value. Which one? The address of
the global variable. If we replace the zeros by a combination that is able to
load the address of the global variable the code will work. That's what the
relocation does here.

So, as we don't know where to point, we insert anything (zeros) and we fix the
instructions when we know where do they need to point to.

[^nonsense-code]: `addi a5,a5, 0x0` is adding 0 to `a5` and storing it in `a5`
  so it's just moving it. RISC-V has a pseudoinstruction for that: `mv a5,a5`,
  which is expanded to the `addi`. `objdump` is going to write `mv` in its
  output, because it tries to be clever, but that is not the goal of the
  instruction we have. I changed it to the actual instruction so we can
  understand this better.

###### Example: Lightening {#relocs-lightening}

The same approach is followed in Lightening, and you can follow in your
assembler, library or anything that has a similar problem. Let's consider some
code using Lightening (obtained from `tests/beqr.c`, comments added by me):
    
    ::clike
    // Make a function that loads two arguments
    jit_load_args_2(j, jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R0),
                    jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R1));
    
    jit_reloc_t r = jit_beqr(j, JIT_R0, JIT_R1); // branch if equal registers
    jit_leave_jit_abi(j, 0, 0, align);           // end ABI context
    jit_reti(j, 0);                              // return 0
    jit_patch_here(j, r);                        // make the branch jump here
    jit_leave_jit_abi(j, 0, 0, align);           // end ABI context
    jit_reti(j, 1);                              // return 1

    // Obtain the function we created
    jit_word_t (*f)(jit_word_t, jit_word_t) = jit_end(j, NULL);

    // Test if it works
    ASSERT(f(0, 0) == 1);       // 0 == 0 so it jumps        -> returns 1
    ASSERT(f(0, 1) == 0);       // 0 != 1 so it doesn't jump -> returns 0

In this example we see how we generate machine code statement by statement, so
there's no way to know where does the `beqr` need to jump until we generated
all the code for it.

You see the `beqr` function doesn't receive the target address or offset as an
argument, but it returns a `jit_reloc_t`, which other functions like `reti`
don't return.

That `jit_reloc_t` is what we are patching later with the `jit_patch_here`
indicating where does it need to jump. The `jit_patch_here` function is going
to correct the bits we set to zero because we didn't know the target at that
moment.

There are different kinds of relocations, as it happened in the previous
example with the compilers, because different instruction formats need to be
patched in different ways. In the case of Lightening, the relocation has a type
associated with it, so we can check and act accordingly.


#### Problem: Long jumps {#jumps}

As we saw, some jumps encode the target as an *immediate*. This has a couple of
implications that we described previously:

- The jump target could be larger than the space we have for the immediate.
- Sometimes we can't know the target until we reach the position where the jump
  points to.

Both issues can be combined together in a killer combo. Consider this code:

    ::clike
    j label     // jump to label

    // A lot of instructions here

    label:
        // this is the target of the jump

In RISC-V the `j` pseudoinstruction is resolved to `jal`, that has a `21` bit
(signed) space for the jump target. If we had a hella lot of instructions
between the jump and the target we may need more bits for the jump than the
space we actually have.

Again, in the case were we can preprocess everything there's no problem, but if
we are assembling the instructions as they come we are going to have issues.
We realize we can't jump that far too late, because we already inserted
a 21 bit jump and too many instructions when we reach the label. Patching the
jump is not enough, because we didn't leave enough space to insert the offset
we need.

##### Solution: always insert the largest jump possible {#always-largest}

There's an obvious solution: always insert the largest possible jump and patch
the whole jump later.

In RISC-V `jalr` jumps to the absolute address that is stored on a register
with an optional 12 bit (signed) offset. Combined with the `auipc` (add upper
immediate to program counter) it lets us make 32 bit relative jumps in just 2
instructions. Let's explain that in code just in case:

    ::clike
    auipc t0, offset_hi         // x[t0] = PC + (offset_hi<<12)
    jalr zero, offset_lo(t0)    // PC    = x[t0] + offset_lo

If we consider `offset` as a value we know, we can split it in two blocks: the
highest 20 bits as `offset_hi` and the lowest 12 bits as `offset_low` and use
them to jump to any address in the 32 range from the current position, using
just 2 instructions.

In 32 bit machines, this jump is the largest jump possible, because the machine
can only address 32 bits, so we will be sure that any relative (or absolute,
using `lui` instead of `auipc`) jump we want to make can fit in place. The only
thing we have to take in account is to patch both instructions when we find the
targets, not only one.

###### Optimization: pointer relaxation {#relaxation}

But using the largest possible jumps can lead to inefficiencies because we use
two instructions for jumps that can potentially fit in just one.

We can use something we saw before for that: relocations.  More specifically,
in the case of the GCC toolchain, we can use the `R_RISCV_RELAX` that appeared
before.

The relaxation relocation is going to tell the next step, which can be the
linker or anything else depending on the context we are working on, that the
pointer can be relaxed. In the case of the `auipc` + `jalr`, possibly by
replacing both instructions by a 21 bit jump like `jal`.

So we start with the longest jump possible, but when we actually know the
target of the jump, we can reduce it to something smaller that needs fewer
instructions.

###### Example: relaxed global variable access in C compilers {#relax-example}

Global variables, as we saw before, are some of those points where compilers
need to use relocations and let the linker clean the result.

Global variables don't necessarily involve jumps but they do involve pointers
for the loads and stores needed to operate with them. In the final executables,
global variables are part of the `.data` segment, because they are known at
compilation time, so we can exploit that fact a little and relax our weird
`auipc` + *load/store* combos.

RISC-V has many registers, so we can use them for things that may not be the
norm in other platforms where registers are scarce. In this case, we can
exploit the `gp` (global pointer) register on RISC-V to improve the access to
the global variables. We can cache the address of the `.data` segment of the
program in the `gp` register so, as we know most of the global variables are
going to be near (12 bit offset) to the beginning of the `.data` segment, we
are probably going to be able to remove some of the `auipc`s we inserted
before.

So a simple load of a global 64bit variable to a register:

    ::clike
    auipc t0, offset_to_global_hi  // x[t0] = PC + offset_to_global_hi << 12
    ld t0, offset_to_global_lo(t0) // x[t0] = mem[ x[t0] + offset_to_global_lo ]

Is optimized to this:

    ::clike
    ld t0, offset_from_data(gp)    // x[t0] = mem[ x[gp] + offset_from_data ]

Of course, the offsets have to be calculated and all that, but this not that
difficult.


##### Solution: Veneers {#veneer}

There are other solutions that don't involve messing around with the code we
generated earlier in an aggressive way like removing instructions, which can be
pretty bad because you have to shift the array of instructions you generated to
clean the gaps the pointer relaxation leaves.

Veneers are non destructive, and they involve no instruction reorganization, so
they are interesting for those cases where you need to generate the code as you
go.

Let's explain them with an example:

    ::clike
    beq a0, a1, branch // Jump to `branch` if x[a0] == x[a1]

    // Instructions...

    branch:
    // Branch target

As we saw previously, if we insert too many instructions in between the jump
and the target we screw it. What we didn't mention is that as we go assembling
instructions one by one we can keep a track of the amount of instructions we
are inserting.

Having that in mind, we can take decisions in time, right before it's too late.
We can combine that knowledge with the [constant insertion](#constants) method
introduced before to insert full-range jumps if needed, right before we exhaust
the possible offset of the original instruction.

Of course, we need to patch the original instruction to jump to the code we are
just going to insert, and we need to add some protections around the veneer to
make it only accessible to the original jump.

    ::clike
    beq a0, a1, veneer      // Jump to `veneer` if x[a0] == x[a1]

    // Many instructions, but not too many!

    // Here we realize we are running out of offset range so we insert a helper
    // block that lets us jump further.
    j avoid                 // Jump to `avoid` so the normal execution flow
                            // doesn't fall in the veneer
    veneer:
        auipc t0,0          // x[t0] = PC + 0
        ld t0,12(t0)        // x[t0] = mem[ x[t0] + 12 ]
        jalr zero,0(t0)     // PC    = x[t0]
        ADDRESS(branch)     // Literal address of `branch` label
    avoid:

    // As many instructions as we want

    branch:
    // Branch target

As it happened with constant insertion, there are positions where the veneer
insertion can be optimized a little, like right after a return or an
unconditional jump, so we don't need the protection (`j avoid` in the example).

The bad thing about veneers is they insert a bunch of instructions in the cases
that are not in range and the jumps are done in two steps, that has a negative
effect on the performance because they drop the pre-processed instructions [in
the pipeline](https://en.wikipedia.org/wiki/Instruction_pipelining).

Of course, the veneers themselves have to be patched too, because we won't know
the target (`branch` in the example) until we reach it. But, in the case of the
veneer we can be 100% sure that we are going to be able to point to the target.

###### Example: Lightening's constant pools {#lightening-veneer}

Lightening uses veneers for the jumps[^veneer-if-needed], but they are part of
Lightening's constant pool mechanism. Constant pools work the same for the
constant insertion than for veneers, because veneers are basically constants.
Remember, code is numbers!

[^veneer-if-needed]: Only in the architectures that need them. `x86` does not
  need constant pools or veneers because the ISA is complex enough to handle
  the problematic cases adding levels of complexity ISAs like RISC-V or ARM
  didn't want to deal with. RISC vs CISC, y'know...

Basically anything that might be inserted as a constant, which can be a veneer
or just a number or whatever, is queued to the constant pool. The library is
going to emit instructions and check on each instruction if it needs to emit
any of the constants of the pool.

The constant pool and each of the entries on the pool have associated
information that tells the emitter if they need to be emitted now or if they
can wait for later so the emitter can decide to insert them.

The literal pool entries have, of course, an associated relocation that
contains information of the original jump or load instructions we may need to
patch, as we already saw. So, in the case of a veneer emission, we need to
patch the original jump to the veneer and remember the veneer needs to be
patched later, when we find its target.

The mechanism is not complex, but it's not simple neither. There are several
kinds of relocations, depending on what we want to do with them, different kind
of patches we need to do, address calculations and all those things that
require a level of attention to the detail I'm not prepared to talk about.

#### Problem: Register access {#reg-access}

You may have seen a problematic point in some of the solutions we worked with:
we are using registers.

It's not a problem by itself, but using registers might be really problematic
if we are inserting code between the instructions someone else wrote because we
can't control the register use the original program did and we might be
changing the values inside of the registers in the magic tricks we sneakily
inserted.

Imagine we use, say, `t0` register in our veneer but the original program uses
that register for something else. That's a problem. We are messing with the
value in the register and potentially (surely) breaking the program.

##### Solution: use the stack {#stack}

The most obvious solution you can think of is to use the stack. We can surround
our veneers or code insertions with some protection code that saves the values
of the registers on the stack and restores them when we finished.

It's a simple solution in your mind, but if you need to deal with the jumps it
can get messy. You may need to restore the register far away in the code and
keeping track of everything. It can be complicated.

On the other hand, memory access is slow and boring and we don like that kind
of things in our lives. We need more dynamite.

##### Solution: controlled register access {#controlled-regs}

The other solution we can provide is to keep control of the registers that are
being accessed and use others for our intervention.

A simple way to do this is to provide functions to get and release temporary
registers, instead of letting the programmers do whatever they want. This makes
sure that all the register access is controlled and we are not changing the
values of any register in use.

The main problem we can have comes when the programmer needs all the register
for their things and then we can't really use any for our magic tricks. But we
can always keep at least one register for us and only for us (throwing an error
to the programmer when they use it) or even combine the use of the stack with
this solution.

If we are directly working with assembly code, where we can't force the
programmer to use the interface we want, we can choose the solutions that don't
involve register access so we don't need to analyze the code to deduce if the
programmer is using the registers or not. Avoiding the problem is sometimes the
best solution.

In the case of libraries like Lightening register access control is a must
because Lightening can't control how its (virtual-) instructions are translated
to machine code instructions: each machine has its own peculiarities and
details. In many cases they need to make use of temporary registers and, as the
instructions are built incrementally, preventing instructions from peeing on
each other is important.

---

<div style="
    text-align: center;
    font-size: smaller;
    padding-left: 3em;
    padding-right: 3em;
    padding-top: 1em;
    padding-bottom: 1em;
    border-top: 1px solid var(--border-color);
    border-bottom: 1px solid var(--border-color)">
Please, consider supporting me on <a href="https://buymeacoffee.com/ekaitz">Buy
Me a Coffee</a> or on <a href="https://liberapay.com/ekaitz">Liberapay</a> to
encourage my free software work.
</div>

---

### Final thoughts {#final}

I know these are just a few things, but they are enough to let you make your
first program that involves machine code generation to certain level.

I'm not a computer scientist but a telecommunication engineer[^engineer], so I
may put the focus on things that are obvious for the average reader of these
kind of posts, but at the same time I may be flying over things that I
consider basic due to my studies but the average reader doesn't. In any case,
feel free to [contact me](https://ekaitz.elenq.tech/pages/about.html) if you
have questions or corrections.

Some of the tricks and lessons I included here are more important than others,
but the most important thing is to start thinking in these terms. Try to
understand the problems you face when you have separate compilation, assume
the fact that you can't know the future... The mindset is the most important
point of all this and, once you have it, everything comes easier.

It's also a lot of fun to realize code is just numbers in memory you can mess
around with. I hope you keep it in your brain forever.

I hope this post throws some light on this dark hole that machine code
generation is and makes you try to make your own findings on this beautiful
area of compilers, machines and extremely long blog entries.


[^engineer]: So, for all that software developers that write blog posts like
  "Are we really engineers?" or stuff like that: **I am**, thanks for the
  interest. LOL