summaryrefslogtreecommitdiff
path: root/content
diff options
context:
space:
mode:
Diffstat (limited to 'content')
-rw-r--r--content/hex0.md450
1 files changed, 450 insertions, 0 deletions
diff --git a/content/hex0.md b/content/hex0.md
new file mode 100644
index 0000000..1930885
--- /dev/null
+++ b/content/hex0.md
@@ -0,0 +1,450 @@
+Title: RISC-V Adventures II: hex0
+Date: 2021-06-08
+Category:
+Tags:
+Slug: hex0
+Lang: en
+Summary:
+ A love story about trust, machine code, hexadecimal notation and weird
+ instruction formats, with an epic unexpected solution coming back from the
+ afterlife.
+
+[Stage0](https://github.com/oriansj/stage0) is a crazy project that is pretty
+well aligned with our vision of trust, bootstrappable software and whatnot.
+
+During the last two weeks we have been working on the port of Stage0 to
+RISC-V, providing the very first step of the process, so we came here to talk
+about it, including a fantastic software necromancy moment you are going to
+enjoy.
+
+### The origin of the times
+
+Once upon a time, software was written in machine code. Directly expressing the
+machine instructions by the hands of the *programmers*. That was long, long,
+time ago.
+
+One day some programmer decided to write a translator that mapped that machine
+code to something more human readable and created what we call *assembly
+language* today. It gained popularity and programmers decided to add more and
+more functionalities to the *assembly language* until the point that what they
+created was not a one-to-one mapping with machine code anymore.
+
+That's how the first *programming languages* were born.
+
+Their power was so immense that programmers decided to rewrite all their tools
+using the new *programming languages*, they even wrote newer *programming
+languages* with them.
+
+But power corrupts the mind of the fool. Blinded by the power of *programming
+languages*, most of the programmers forgot the origin of the times, and
+forgetting the history is always a mistake.
+
+*Epic music starts...*
+
+
+### The problem
+
+> Warning: I oversimplified during the beginning of the post, but now... Oh
+> boy! I'm going to flatten this shit.
+
+Well, we need auditable software. I don't think anyone can deny that fact.
+
+But what does "auditable software" mean? Isn't free software enough?
+
+It is true that the best way to audit stuff is to read the code of the
+programs. It's the classic way we had to know if a program is doing what we
+want it to. But, how can you be really sure the code you are reading is the one
+that is shipping with your program?
+
+You can't! In general it's impossible to know. There are many reasons, but I
+will oversimplify and give you just some thoughts and let the people from
+[bootstrappable](http://bootstrappable.org/) do the dirty job.
+
+1. The compilation process is not reproducible, so the same source can result
+ in different binaries. You can't just compare different binaries to make
+ sure your compiler is compiled correctly.
+
+2. We have no way to solve the chicken-egg problem. The recipe to build the
+ compiler version X is to get the sources of the compiler version X and
+ compile them with the compiler version X-1. But how do you get the compiler
+ version X-1? Rinse and repeat.
+ Also... Where's the first version of your compiler? Does it run in modern
+ machines with modern operating systems?
+
+3. As there's no real way to get your compilers compiled by yourself, there's
+ no real way to be sure that the compilers are emitting the code they are
+ supposed to. You have to [**trust
+ them**](https://www.cs.umass.edu/~emery/classes/cmpsci691st/readings/Sec/Reflections-on-Trusting-Trust.pdf),
+ and you can't audit what you need to trust.
+
+The first point is where projects like Nix and Guix have sense: they try to
+create reproducible stuff. Not only for compilation processes but also for
+scientific studies (that have to be reproduced by other people, because...
+that's how science works, isn't it?) and other things. Being able to create
+identical environments where you can ensure that their (compilation) output is
+going to be identical is extremely important, but I'll leave that for now.
+
+The second and third point are two different problems but they result in the
+same: software distributions ship huge binary blobs (hundreds of megabytes) as
+an starting point (bash, gcc...) so the users have no chance to check if those
+binaries are corrupt.
+
+[GNU Mes](https://www.gnu.org/software/mes/) is a Scheme interpreter and C
+compiler that was designed to reduce the size of the binaries you need to ship
+with your distro. Mes has successfully reduced the size of the binaries that
+need to be shipped with distros like Guix, but the project is more ambitious
+that that.
+
+### Full-source bootstrap
+
+[Stage-0](https://savannah.nongnu.org/projects/stage0/) is a project that is
+tackling the same "trusting trust" problem but from the opposite perspective,
+starting in the low-level, rather than the high-level approach that GNU Mes
+uses.
+
+Both projects work together to provide a greater goal: the full-source
+bootstrap. The whole bootstrap is started from source, with no binaries
+involved, so the distros don't need to ship binaries anymore.
+
+But how is that?
+
+### Hex0
+
+Stage0 starts in the low level, the lowest possible, and builds more complex
+programs from there, step by step.
+
+The first step, Hex0, is a self-hosting "assembler". I quote the word assembler
+there because I think it's a very strong word for this: it's an ELF file
+written in hexadecimal, with extra comments.
+
+Hex0 is able to compile itself to a binary ELF file, converting the Hexadecimal
+values to the binary values and stripping the comments.
+
+We still have to compile the first Hex0 with something but that's not as
+difficult as compiling the first GCC because basically Hex0 can be compiled by
+very simple programs or even by hand, because it contains literally what is
+going to be written in the final ELF.
+
+The comments on the Hex0 files describe the instructions on each of the lines
+of the ELF file so the resulting files can be audited, instruction by
+instruction, with the manual of the ISA as a reference.
+
+This starting point is more than enough to build on top of. We just need to add
+more functionalities to the next steps: labels, constants... Until we are able
+to compile a simple C compiler, Mes or anything.
+
+It's a clever solution for a crazy problem.
+
+### Hex0 in RISC-V: my experience
+
+So this is my blog and I come here to talk about myself![^1]
+
+Some weeks ago I had the chance to make that first step, Hex0, for RISC-V (64
+bit). You can take a look to [the code here](https://github.com/oriansj/bootstrap-seeds/pull/2/files).
+
+There you can see I added three files: the assembly file, the hex0 file and the
+binary of the compiled hex0. They are basically the same thing, but they are
+included for readability.
+
+#### The assembly
+
+The first step for me was to write the assembly file. It's easy once you know
+how to make system calls in POSIX.
+
+**POSIX system calls in RISC-V** are pretty easy:
+
+- Load the arguments for the call in registers `a0`, `a1`...
+- Load the *syscall number* in `a7`
+- Run `ecall`
+
+The result of the system call comes in `a0`.
+
+**Input arguments** are also important, because we need to be able to tell to
+Hex0 which is the file we want to compile, and where to put its output.
+
+That's pretty easy, input arguments are inserted in the stack, so we can load
+them by *pop*-ing them. As in any C program, the first element we get is the
+amount of arguments and the rest of them are the arguments themselves.
+
+Putting all together, if you take a look to the first block of the program:
+
+ ::asm
+ _start:
+ ld a0, 0(sp) # Get number of the args
+ ld a1, 8(sp) # Get program name
+ ld a2, 16(sp) # Input file name
+
+ ...
+
+ # Open input file and store FD in s2
+ li a7, 56 # sys_openat
+ li a0, -100 # AT_FDCWD
+ mv a1, a2 # input file
+ li a2, 0 # read only
+ ecall
+ mv s2, a0 # Save fd in for later
+
+In the first chunk we are reading the stack, element by element, and in the
+second we are opening the input file, using the filename we just obtained from
+the stack.
+
+Simple.
+
+As a note, I'd like to remind you to finish your program, because if you don't
+it will continue to execute the memory after it and it'll explode in your face:
+
+ ::asm
+ terminate:
+ # Terminate program with 0 return code
+ li a7, 93 # sys_exit
+ li a0, 0 # Return code 0
+ ecall
+
+This tells the OS to finish the execution.
+
+The internals of the assembly file are simple, I won't explain them in detail.
+It basically iterates character by character, removing the comments and
+converting the hex value couples to a byte.
+
+Read it and tell me if you need help understanding it![^contact]
+
+#### The conversion to the Hex0
+
+Hex0, as I said, is an ELF file, written in hexadecimal, so we need to compile
+our assembly file to binary and represent each of the instructions in
+hexadecimal. And we need to resolve all the labels to final addresses.
+
+There's no easy way to do it. I started doing it by hand, reading the RISC-V
+spec and converting the instructions one by one. But I tackled several
+difficulties doing that.
+
+Pseudoinstructions are expanded to more than one instruction so we need to be
+careful in the comments and explain that correctly. Also, we need to resolve
+the addresses accordingly. For example:
+
+ ::asm
+ la a1, buffer
+
+This is a pseudoinstruction, and we need to resolve it to:
+
+ ::asm
+ auipc a1, 0
+ addi a1, a1, $OFFSET
+
+Where `$OFFSET` is the offset from that instruction to the label `buffer`.
+
+These kind of expansions change our perception of the amount of instructions we
+have and we have to be extremely careful. I didn't even mention the case where
+the offset is very large! That's another story (thankfully we don't have to
+deal with yet).
+
+Once the pseudoinstructions are expanded we need to convert them to the hex
+value, and I swear it's the most boring task I ever made in my life. Basically
+because the RISC-V instructions are not easy to map to their binary (for
+reasons related with the hardware implementation).
+
+##### The eagles are coming!
+
+But I had a trick, a deus ex machina that would safe my life. During the last
+months I've been randomly working on a Scheme compiler for RISC-V assembly and
+that made me start making a [RISC-V assembly interpreter and compiler in
+python](http://git.elenq.tech/pysc-v/). It's still an early WIP, and was
+almost abandoned, but it has the basic machinery that lets me compile simple
+instructions to hex.
+
+With this dirty glue code I was able to compile the instructions one by one:
+
+
+ ::python
+ from registers.RV32I import *
+ from InstructionSets.RV64I import *
+
+ Regs = RegistersRV32I()
+
+ def x(registerName):
+ return Regs.getPos(registerName)
+
+ def compile(instruction):
+ hexrepr = hex(instruction.compile().value)
+ hexval = hexrepr[2:]
+ if len(hexval) < 8:
+ hexval = "0" * (8 - len(hexval)) + hexval
+
+ final = ""
+ for i in range(0,8,2):
+ final += hexval[i:i+2] + " "
+ final = final.rstrip().upper()
+ return " ".join(reversed(final.split(" ")))
+
+
+I just needed to open a python shell and write something like:
+
+ ::python
+ compile( addi(x("a0"), x("a1"), 12) )
+
+And that would compile that instruction for me, giving me the output in a
+beautiful hexadecimal format.
+
+ ::hex
+ 13 85 C5 00
+
+Not the best UX but usable enough for a small file like this.
+
+##### The addresses
+
+The addresses are still something to solve.
+
+I'm an idiot so I counted the instructions by hand and then realized I had to
+expand some pseudoinstructions I forgot, so all the branch instructions were
+broken. *Yes, I'm like that*.
+
+Try to be smarter than I am. Use this trick:
+
+Leave all the instructions that use addresses set to a wrong address, like 0 or
+something, until you converted the whole file. Once you have that, resolve the
+addresses. That way you'll make sure every pseudoinstruction is expanded and
+you'll be able to use tools that will help you to choose addresses correctly.
+
+The trick I used was to add the ELF header, compile the file and then inspect
+the resulting binary.
+
+For the compilation there are two choices: we can use the assembly file we
+wrote previously simply assembling it to a binary and use it as the hex0
+assembler; or we can use the high level C prototype that Stage0 provides. In
+any case, they have to give the same result.
+
+I still don't know why `objdump` is unable to process the binaries of the hex0
+files but GDB is able to do it so... Launch a GDB as [I explained in the
+previous post](https://ekaitz.elenq.tech/lightening.html) and disassemble the
+whole file[^where].
+
+It'll look like this:
+
+ ::asm
+ 0x0000000000600078: ld a0,0(sp)
+ 0x000000000060007c: ld a1,8(sp)
+ 0x0000000000600080: ld a2,16(sp)
+ 0x0000000000600084: li s4,0
+ 0x0000000000600088: li s5,0
+ 0x000000000060008c: li a7,56
+ 0x0000000000600090: li a0,-100
+ 0x0000000000600094: mv a1,a2
+ 0x0000000000600098: li a2,0
+ 0x000000000060009c: ecall
+ ...
+
+With that dissasembled result, we can literally make some math with the
+addresses and fix all the instructions. We just need to substract the target
+address from the current address in the branches, so we get the offset.
+
+> NOTE: Be careful with the `la` pseudoinstruction (`auipc + addi`). The base
+> address here is the one of the `auipc`, not the one of the `addi`.
+
+##### The ELF header
+
+If we don't know how to make the ELF header, we can't make the previous step,
+so better if we mention something about it.
+
+Other files of the project are hex0 files too, so they also have a heavily
+commented ELF header we can use as a reference. Also, [wikipedia has a great
+explanation of it](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format#File_header).
+
+The main point we need to change is the `e_machine` field. We need to set it to
+`0xF3`, indicating RISC-V. Also we need to make sure the flag of 64 bit system
+has to be set for RV64 and remember to check the endianness.
+
+> NOTE: Big Endian it's the most natural way to write the file by hand. If you
+> want to go for Little Endian this might get weird to write. The python script
+> above uses Little Endian, watch the `reversed` call on it.
+
+
+##### Debugging
+
+Once you have everything ready you need to make sure it's doing what it's
+supposed to.
+
+My first working program was failing to use an output file. Someone in the
+`#bootstrappable` IRC channel (I'm sorry, I can't remember who was) told me to
+`strace` the program to see what was going on, and with that and some debugging
+with GDB's `layout asm`, I was able to figure out one instruction was using a
+wrong register.
+
+These tools are important because the all the process is done by hand so there
+are many chances to screw up somewhere.
+
+`strace` is extremely handy in this specific program because most of the
+functionality it has is based on system calls. If you clean the output
+correctly you can see everything the program does accurately.
+
+---
+
+<div style="
+ text-align: center;
+ font-size: smaller;
+ padding-left: 3em;
+ padding-right: 3em;
+ padding-top: 1em;
+ padding-bottom: 1em;
+ border-top: 1px solid var(--border-color);
+ border-bottom: 1px solid var(--border-color)">
+Remember you can hire <a href="https://elenq.tech">ElenQ
+Technology</a> to help you with your research, development or training. <br/>
+If you want to encourage my free software work you can support me on <a
+href="https://buymeacoffee.com/ekaitz">Buy Me a Coffee</a> or on <a
+href="https://liberapay.com/ekaitz">Liberapay</a>.
+</div>
+
+---
+
+### Final thoughts
+
+This contribution have been a lot of fun. It let me understand a little bit
+more about the ecosystem around the full-source bootstrap, which is kinda
+complex and includes some other stuff I didn't even mention.
+
+I learned a lot from this, now I have a deeper understanding of the instruction
+formats on RISC-V and I learned some cool GDB tricks that are always useful.
+
+Stage0 has a really interesting approach for auditability that's worth thinking
+about. They build everything from a commented binary file (that's basically
+what hex0 is) that acts like a seed, so we can audit everything, including the
+very first step. The solution of having the contents of the ELF file directly
+written in hexadecimal is enough to ensure we can certify the contents are what
+we expect, and having every instruction commented with its assembly counterpart
+gives us the chance to go to the ISA and check if that's actually what it is
+supposed to be. Perfectly auditable.
+
+Using this first step as a building block for anything else ensures that we
+never need to rely on a binary file we can't know where does it come from.
+
+Really interesting stuff.
+
+Now we have this very first step ported to an open instruction set, what also
+opens the door to the auditability from the hardware perspective. Now we can
+start thinking about having an auditable software stack in a device we designed
+ourselves, so we can audit it too. This is huge.
+
+Now, we need to keep pushing in this direction, porting all the rest of the
+steps of Stage0, Mes and many other projects, if we want to reach a full RISC-V
+support. This is just one small step in that direction.
+
+Hey! I almost forgot! And thanks to this I had the chance to work a little bit
+more on my assembly interpreter, and recover it from the darkness. That's also
+great. Isn't it?
+
+> Well, so we learned some things today, but the most important is that all the
+> stupid things we do, all the random projects we work on, all the experiences
+> we have in life are not just a *waste of time*. They may appear to be useful
+> in the future, but you don't know when...
+> What is sure here is if you don't stay creative and active, you'll never
+> have any experience to learn from.
+
+
+[^1]: Not really, in fact, I use my experience as a vehicle to introduce you to
+ great projects and interesting pieces of knowledge. But ssssh, don't tell
+ anyone.
+
+[^contact]: Contact me, seriously. I have my [contact info here](https://ekaitz.elenq.tech/pages/about.html)
+
+[^where]: If you want to know where to start to disassemble, you can just ask
+ `where` to GDB.