summaryrefslogtreecommitdiff
path: root/content/hex0.md
blob: 4e5ffe417729345022c61dfc9e00e64e26be2aec (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
Title: RISC-V Adventures II: hex0
Date: 2021-06-08
Category:
Tags:
Slug: hex0
Lang: en
Summary:
    A love story about trust, machine code, hexadecimal notation and weird
    instruction formats, with an epic unexpected solution coming back from the
    afterlife.

[Stage0](https://github.com/oriansj/stage0) is a crazy project that is pretty
well aligned with our vision of trust, bootstrappable software and whatnot.

During the last two weeks we have been working on the port of Stage0 to
RISC-V, providing the very first step of the process, so we came here to talk
about it, including a fantastic software necromancy moment you are going to
enjoy.

### The origin of the times

Once upon a time, software was written in machine code. Directly expressing the
machine instructions by the hands of the *programmers*. That was long, long,
time ago.

One day some programmer decided to write a translator that mapped that machine
code to something more human readable and created what we call *assembly
language* today. It gained popularity and programmers decided to add more and
more functionalities to the *assembly language* until the point that what they
created was not a one-to-one mapping with machine code anymore.

That's how the first *programming languages* were born.

Their power was so immense that programmers decided to rewrite all their tools
using the new *programming languages*, they even wrote newer *programming
languages* with them.

But power corrupts the mind of the fool. Blinded by the power of *programming
languages*, most of the programmers forgot the origin of the times, and
forgetting the history is always a mistake.

*Epic music starts...*


### The problem

> Warning: I oversimplified during the beginning of the post, but now... Oh
> boy! I'm going to flatten this shit.

Well, we need auditable software. I don't think anyone can deny that fact.

But what does "auditable software" mean? Isn't free software enough?

It is true that the best way to audit stuff is to read the code of the
programs. It's the classic way we had to know if a program is doing what we
want it to. But, how can you be really sure the code you are reading is the one
that is shipping with your program? 

You can't! In general it's impossible to know.  There are many reasons, but I
will oversimplify and give you just some thoughts and let the people from
[bootstrappable](http://bootstrappable.org/) do the dirty job.

1. The compilation process is not reproducible, so the same source can result
   in different binaries. You can't just compare different binaries to make
   sure your compiler is compiled correctly.

2. We have no way to solve the chicken-egg problem.  The recipe to build the
   compiler version X is to get the sources of the compiler version X and
   compile them with the compiler version X-1. But how do you get the compiler
   version X-1? Rinse and repeat.  
   Also... Where's the first version of your compiler? Does it run in modern
   machines with modern operating systems?

3. As there's no real way to get your compilers compiled by yourself, there's
   no real way to be sure that the compilers are emitting the code they are
   supposed to.  You have to [**trust
   them**](https://www.cs.umass.edu/~emery/classes/cmpsci691st/readings/Sec/Reflections-on-Trusting-Trust.pdf),
   and you can't audit what you need to trust.

The first point is where projects like Nix and Guix have sense: they try to
create reproducible stuff. Not only for compilation processes but also for
scientific studies (that have to be reproduced by other people, because...
that's how science works, isn't it?) and other things. Being able to create
identical environments where you can ensure that their (compilation) output is
going to be identical is extremely important, but I'll leave that for now.

The second and third point are two different problems but they result in the
same: software distributions ship huge binary blobs (hundreds of megabytes) as
an starting point (bash, gcc...) so the users have no chance to check if those
binaries are corrupt.

[GNU Mes](https://www.gnu.org/software/mes/) is a Scheme interpreter and C
compiler that was designed to reduce the size of the binaries you need to ship
with your distro. Mes has successfully reduced the size of the binaries that
need to be shipped with distros like Guix, but the project is more ambitious
that that.

### Full-source bootstrap

[Stage-0](https://savannah.nongnu.org/projects/stage0/) is a project that is
tackling the same "trusting trust" problem but from the opposite perspective,
starting in the low-level, rather than the high-level approach that GNU Mes
uses.

Both projects work together to provide a greater goal: the full-source
bootstrap. The whole bootstrap is started from source, with no binaries
involved, so the distros don't need to ship binaries anymore.

But how is that?

### Hex0

Stage0 starts in the low level, the lowest possible, and builds more complex
programs from there, step by step.

The first step, Hex0, is a self-hosting "assembler". I quote the word assembler
there because I think it's a very strong word for this: it's an ELF file
written in hexadecimal, with extra comments.

Hex0 is able to compile itself to a binary ELF file, converting the Hexadecimal
values to the binary values and stripping the comments.

We still have to compile the first Hex0 with something but that's not as
difficult as compiling the first GCC because basically Hex0 can be compiled by
very simple programs or even by hand, because it contains literally what is
going to be written in the final ELF.

The comments on the Hex0 files describe the instructions on each of the lines
of the ELF file so the resulting files can be audited, instruction by
instruction, with the manual of the ISA as a reference.

This starting point is more than enough to build on top of. We just need to add
more functionalities to the next steps: labels, constants... Until we are able
to compile a simple C compiler, Mes or anything.

It's a clever solution for a crazy problem.

### Hex0 in RISC-V: my experience

So this is my blog and I come here to talk about myself![^1]

Some weeks ago I had the chance to make that first step, Hex0, for RISC-V (64
bit). You can take a look to [the code here](https://github.com/oriansj/bootstrap-seeds/pull/2/files).

There you can see I added three files: the assembly file, the hex0 file and the
binary of the compiled hex0. They are basically the same thing, but they are
included for readability.

#### The assembly

The first step for me was to write the assembly file. It's easy once you know
how to make system calls in POSIX.

**POSIX system calls in RISC-V** are pretty easy:

- Load the arguments for the call in registers `a0`, `a1`...
- Load the *syscall number* in `a7`
- Run `ecall`

The result of the system call comes in `a0`.

**Input arguments** are also important, because we need to be able to tell to
Hex0 which is the file we want to compile, and where to put its output.

That's pretty easy, input arguments are inserted in the stack, so we can load
them by *pop*-ing them. As in any C program, the first element we get is the
amount of arguments and the rest of them are the arguments themselves.

Putting all together, if you take a look to the first block of the program:

    ::asm
    _start:
    ld a0, 0(sp)         # Get number of the args
    ld a1, 8(sp)         # Get program name
    ld a2, 16(sp)        # Input file name

    ...

    # Open input file and store FD in s2
    li a7, 56             # sys_openat
    li a0, -100           # AT_FDCWD
    mv a1, a2             # input file
    li a2, 0              # read only
    ecall
    mv s2, a0             # Save fd in for later

In the first chunk we are reading the stack, element by element, and in the
second we are opening the input file, using the filename we just obtained from
the stack.

Simple.

As a note, I'd like to remind you to finish your program, because if you don't
it will continue to execute the memory after it and it'll explode in your face:

    ::asm
    terminate:
    # Terminate program with 0 return code
    li a7, 93             # sys_exit
    li a0, 0              # Return code 0
    ecall

This tells the OS to finish the execution.

The internals of the assembly file are simple, I won't explain them in detail.
It basically iterates character by character, removing the comments and
converting the hex value couples to a byte.

Read it and tell me if you need help understanding it![^contact]

#### The conversion to the Hex0

Hex0, as I said, is an ELF file, written in hexadecimal, so we need to compile
our assembly file to binary and represent each of the instructions in
hexadecimal. And we need to resolve all the labels to final addresses.

There's no easy way to do it. I started doing it by hand, reading the RISC-V
spec and converting the instructions one by one. But I tackled several
difficulties doing that.

Pseudoinstructions are expanded to more than one instruction so we need to be
careful in the comments and explain that correctly. Also, we need to resolve
the addresses accordingly. For example:

    ::asm
    la a1, buffer

This is a pseudoinstruction, and we need to resolve it to:

    ::asm
    auipc a1, 0
    addi a1, a1, $OFFSET

Where `$OFFSET` is the offset from that instruction to the label `buffer`.

These kind of expansions change our perception of the amount of instructions we
have and we have to be extremely careful. I didn't even mention the case where
the offset is very large! That's another story (thankfully we don't have to
deal with yet).

Once the pseudoinstructions are expanded we need to convert them to the hex
value, and I swear it's the most boring task I ever made in my life. Basically
because the RISC-V instructions are not easy to map to their binary (for
reasons related with the hardware implementation).

##### The eagles are coming!

But I had a trick, a deus ex machina that would safe my life. During the last
months I've been randomly working on a Scheme compiler for RISC-V assembly and
that made me start making a [RISC-V assembly interpreter and compiler in
python](http://git.elenq.tech/pysc-v/).  It's still an early WIP, and was
almost abandoned, but it has the basic machinery that lets me compile simple
instructions to hex.

With this dirty glue code I was able to compile the instructions one by one:


    ::python
    from registers.RV32I import *
    from InstructionSets.RV64I import *

    Regs = RegistersRV32I()

    def x(registerName):
        return Regs.getPos(registerName)

    def compile(instruction):
        hexrepr = hex(instruction.compile().value)
        hexval = hexrepr[2:]
        if len(hexval) < 8:
            hexval = "0" * (8 - len(hexval)) + hexval

        final = ""
        for i in range(0,8,2):
            final += hexval[i:i+2] + " "
        final = final.rstrip().upper()
        return " ".join(reversed(final.split(" ")))


I just needed to open a python shell and write something like:

    ::python
    compile( addi(x("a0"), x("a1"), 12) )

And that would compile that instruction for me, giving me the output in a
beautiful hexadecimal format.

    ::hex
    13 85 C5 00

Not the best UX but usable enough for a small file like this.

##### The addresses

The addresses are still something to solve.

I'm an idiot so I counted the instructions by hand and then realized I had to
expand some pseudoinstructions I forgot, so all the branch instructions were
broken. *Yes, I'm like that*.

Try to be smarter than I am. Use this trick:

Leave all the instructions that use addresses set to a wrong address, like 0 or
something, until you converted the whole file. Once you have that, resolve the
addresses. That way you'll make sure every pseudoinstruction is expanded and
you'll be able to use tools that will help you to choose addresses correctly.

The trick I used was to add the ELF header, compile the file and then inspect
the resulting binary.

For the compilation there are two choices: we can use the assembly file we
wrote previously simply assembling it to a binary and use it as the hex0
assembler; or we can use the high level C prototype that Stage0 provides. In
any case, they have to give the same result.

I still don't know why `objdump` is unable to process the binaries of the hex0
files but GDB is able to do it so... Launch a GDB as [I explained in the
previous post](https://ekaitz.elenq.tech/lightening.html) and disassemble the
whole file[^where].

It'll look like this:

    ::asm
    0x0000000000600078:	ld	a0,0(sp)
    0x000000000060007c:	ld	a1,8(sp)
    0x0000000000600080:	ld	a2,16(sp)
    0x0000000000600084:	li	s4,0
    0x0000000000600088:	li	s5,0
    0x000000000060008c:	li	a7,56
    0x0000000000600090:	li	a0,-100
    0x0000000000600094:	mv	a1,a2
    0x0000000000600098:	li	a2,0
    0x000000000060009c:	ecall
    ...

With that dissasembled result, we can literally make some math with the
addresses and fix all the instructions. We just need to substract the target
address from the current address in the branches, so we get the offset.

> NOTE: Be careful with the `la` pseudoinstruction (`auipc + addi`). The base
> address here is the one of the `auipc`, not the one of the `addi`.

##### The ELF header

If we don't know how to make the ELF header, we can't make the previous step,
so better if we mention something about it.

Other files of the project are hex0 files too, so they also have a heavily
commented ELF header we can use as a reference.  Also, [wikipedia has a great
explanation of it](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format#File_header).

The main point we need to change is the `e_machine` field. We need to set it to
`0xF3`, indicating RISC-V. Also we need to make sure the flag of 64 bit system
has to be set for RV64 and remember to check the endianness.

> NOTE: Big Endian it's the most natural way to write the file by hand. If you
> want to go for Little Endian this might get weird to write. The python script
> above uses Little Endian, watch the `reversed` call on it.


##### Debugging

Once you have everything ready you need to make sure it's doing what it's
supposed to.

My first working program was failing to use an output file.  Someone in the
`#bootstrappable` IRC channel (I'm sorry, I can't remember who was) told me to
`strace` the program to see what was going on, and with that and some debugging
with GDB's `layout asm`, I was able to figure out one instruction was using a
wrong register.

These tools are important because the all the process is done by hand so there
are many chances to screw up somewhere.

`strace` is extremely handy in this specific program because most of the
functionality it has is based on system calls. If you clean the output
correctly you can see everything the program does accurately.

---

<div style="
    text-align: center;
    font-size: smaller;
    padding-left: 3em;
    padding-right: 3em;
    padding-top: 1em;
    padding-bottom: 1em;
    border-top: 1px solid var(--border-color);
    border-bottom: 1px solid var(--border-color)">
Remember you can hire <a href="https://elenq.tech">ElenQ
Technology</a> to help you with your research, development or training. <br/>
If you want to encourage my free software work you can support me on <a
href="https://liberapay.com/ekaitz">Liberapay</a>.
</div>

---

### Final thoughts

This contribution have been a lot of fun. It let me understand a little bit
more about the ecosystem around the full-source bootstrap, which is kinda
complex and includes some other stuff I didn't even mention.

I learned a lot from this, now I have a deeper understanding of the instruction
formats on RISC-V and I learned some cool GDB tricks that are always useful.

Stage0 has a really interesting approach for auditability that's worth thinking
about. They build everything from a commented binary file (that's basically
what hex0 is) that acts like a seed, so we can audit everything, including the
very first step. The solution of having the contents of the ELF file directly
written in hexadecimal is enough to ensure we can certify the contents are what
we expect, and having every instruction commented with its assembly counterpart
gives us the chance to go to the ISA and check if that's actually what it is
supposed to be. Perfectly auditable.

Using this first step as a building block for anything else ensures that we
never need to rely on a binary file we can't know where does it come from.

Really interesting stuff.

Now we have this very first step ported to an open instruction set, what also
opens the door to the auditability from the hardware perspective. Now we can
start thinking about having an auditable software stack in a device we designed
ourselves, so we can audit it too. This is huge.

Now, we need to keep pushing in this direction, porting all the rest of the
steps of Stage0, Mes and many other projects, if we want to reach a full RISC-V
support. This is just one small step in that direction.

Hey! I almost forgot! And thanks to this I had the chance to work a little bit
more on my assembly interpreter, and recover it from the darkness. That's also
great. Isn't it?

> Well, so we learned some things today, but the most important is that all the
> stupid things we do, all the random projects we work on, all the experiences
> we have in life are not just a *waste of time*. They may appear to be useful
> in the future, but you don't know when...  
> What is sure here is if you don't stay creative and active, you'll never
> have any experience to learn from.


[^1]: Not really, in fact, I use my experience as a vehicle to introduce you to
  great projects and interesting pieces of knowledge. But ssssh, don't tell
  anyone.

[^contact]: Contact me, seriously. I have my [contact info here](https://ekaitz.elenq.tech/pages/about.html)

[^where]: If you want to know where to start to disassemble, you can just ask
  `where` to GDB.