Title: RISC-V Adventures II: hex0 Date: 2021-06-08 Category: Tags: Slug: hex0 Lang: en Summary: A love story about trust, machine code, hexadecimal notation and weird instruction formats, with an epic unexpected solution coming back from the afterlife. [Stage0](https://github.com/oriansj/stage0) is a crazy project that is pretty well aligned with our vision of trust, bootstrappable software and whatnot. During the last two weeks we have been working on the port of Stage0 to RISC-V, providing the very first step of the process, so we came here to talk about it, including a fantastic software necromancy moment you are going to enjoy. ### The origin of the times Once upon a time, software was written in machine code. Directly expressing the machine instructions by the hands of the *programmers*. That was long, long, time ago. One day some programmer decided to write a translator that mapped that machine code to something more human readable and created what we call *assembly language* today. It gained popularity and programmers decided to add more and more functionalities to the *assembly language* until the point that what they created was not a one-to-one mapping with machine code anymore. That's how the first *programming languages* were born. Their power was so immense that programmers decided to rewrite all their tools using the new *programming languages*, they even wrote newer *programming languages* with them. But power corrupts the mind of the fool. Blinded by the power of *programming languages*, most of the programmers forgot the origin of the times, and forgetting the history is always a mistake. *Epic music starts...* ### The problem > Warning: I oversimplified during the beginning of the post, but now... Oh > boy! I'm going to flatten this shit. Well, we need auditable software. I don't think anyone can deny that fact. But what does "auditable software" mean? Isn't free software enough? It is true that the best way to audit stuff is to read the code of the programs. It's the classic way we had to know if a program is doing what we want it to. But, how can you be really sure the code you are reading is the one that is shipping with your program? You can't! In general it's impossible to know. There are many reasons, but I will oversimplify and give you just some thoughts and let the people from [bootstrappable](http://bootstrappable.org/) do the dirty job. 1. The compilation process is not reproducible, so the same source can result in different binaries. You can't just compare different binaries to make sure your compiler is compiled correctly. 2. We have no way to solve the chicken-egg problem. The recipe to build the compiler version X is to get the sources of the compiler version X and compile them with the compiler version X-1. But how do you get the compiler version X-1? Rinse and repeat. Also... Where's the first version of your compiler? Does it run in modern machines with modern operating systems? 3. As there's no real way to get your compilers compiled by yourself, there's no real way to be sure that the compilers are emitting the code they are supposed to. You have to [**trust them**](https://www.cs.umass.edu/~emery/classes/cmpsci691st/readings/Sec/Reflections-on-Trusting-Trust.pdf), and you can't audit what you need to trust. The first point is where projects like Nix and Guix have sense: they try to create reproducible stuff. Not only for compilation processes but also for scientific studies (that have to be reproduced by other people, because... that's how science works, isn't it?) and other things. Being able to create identical environments where you can ensure that their (compilation) output is going to be identical is extremely important, but I'll leave that for now. The second and third point are two different problems but they result in the same: software distributions ship huge binary blobs (hundreds of megabytes) as an starting point (bash, gcc...) so the users have no chance to check if those binaries are corrupt. [GNU Mes](https://www.gnu.org/software/mes/) is a Scheme interpreter and C compiler that was designed to reduce the size of the binaries you need to ship with your distro. Mes has successfully reduced the size of the binaries that need to be shipped with distros like Guix, but the project is more ambitious that that. ### Full-source bootstrap [Stage-0](https://savannah.nongnu.org/projects/stage0/) is a project that is tackling the same "trusting trust" problem but from the opposite perspective, starting in the low-level, rather than the high-level approach that GNU Mes uses. Both projects work together to provide a greater goal: the full-source bootstrap. The whole bootstrap is started from source, with no binaries involved, so the distros don't need to ship binaries anymore. But how is that? ### Hex0 Stage0 starts in the low level, the lowest possible, and builds more complex programs from there, step by step. The first step, Hex0, is a self-hosting "assembler". I quote the word assembler there because I think it's a very strong word for this: it's an ELF file written in hexadecimal, with extra comments. Hex0 is able to compile itself to a binary ELF file, converting the Hexadecimal values to the binary values and stripping the comments. We still have to compile the first Hex0 with something but that's not as difficult as compiling the first GCC because basically Hex0 can be compiled by very simple programs or even by hand, because it contains literally what is going to be written in the final ELF. The comments on the Hex0 files describe the instructions on each of the lines of the ELF file so the resulting files can be audited, instruction by instruction, with the manual of the ISA as a reference. This starting point is more than enough to build on top of. We just need to add more functionalities to the next steps: labels, constants... Until we are able to compile a simple C compiler, Mes or anything. It's a clever solution for a crazy problem. ### Hex0 in RISC-V: my experience So this is my blog and I come here to talk about myself![^1] Some weeks ago I had the chance to make that first step, Hex0, for RISC-V (64 bit). You can take a look to [the code here](https://github.com/oriansj/bootstrap-seeds/pull/2/files). There you can see I added three files: the assembly file, the hex0 file and the binary of the compiled hex0. They are basically the same thing, but they are included for readability. #### The assembly The first step for me was to write the assembly file. It's easy once you know how to make system calls in POSIX. **POSIX system calls in RISC-V** are pretty easy: - Load the arguments for the call in registers `a0`, `a1`... - Load the *syscall number* in `a7` - Run `ecall` The result of the system call comes in `a0`. **Input arguments** are also important, because we need to be able to tell to Hex0 which is the file we want to compile, and where to put its output. That's pretty easy, input arguments are inserted in the stack, so we can load them by *pop*-ing them. As in any C program, the first element we get is the amount of arguments and the rest of them are the arguments themselves. Putting all together, if you take a look to the first block of the program: ::asm _start: ld a0, 0(sp) # Get number of the args ld a1, 8(sp) # Get program name ld a2, 16(sp) # Input file name ... # Open input file and store FD in s2 li a7, 56 # sys_openat li a0, -100 # AT_FDCWD mv a1, a2 # input file li a2, 0 # read only ecall mv s2, a0 # Save fd in for later In the first chunk we are reading the stack, element by element, and in the second we are opening the input file, using the filename we just obtained from the stack. Simple. As a note, I'd like to remind you to finish your program, because if you don't it will continue to execute the memory after it and it'll explode in your face: ::asm terminate: # Terminate program with 0 return code li a7, 93 # sys_exit li a0, 0 # Return code 0 ecall This tells the OS to finish the execution. The internals of the assembly file are simple, I won't explain them in detail. It basically iterates character by character, removing the comments and converting the hex value couples to a byte. Read it and tell me if you need help understanding it![^contact] #### The conversion to the Hex0 Hex0, as I said, is an ELF file, written in hexadecimal, so we need to compile our assembly file to binary and represent each of the instructions in hexadecimal. And we need to resolve all the labels to final addresses. There's no easy way to do it. I started doing it by hand, reading the RISC-V spec and converting the instructions one by one. But I tackled several difficulties doing that. Pseudoinstructions are expanded to more than one instruction so we need to be careful in the comments and explain that correctly. Also, we need to resolve the addresses accordingly. For example: ::asm la a1, buffer This is a pseudoinstruction, and we need to resolve it to: ::asm auipc a1, 0 addi a1, a1, $OFFSET Where `$OFFSET` is the offset from that instruction to the label `buffer`. These kind of expansions change our perception of the amount of instructions we have and we have to be extremely careful. I didn't even mention the case where the offset is very large! That's another story (thankfully we don't have to deal with yet). Once the pseudoinstructions are expanded we need to convert them to the hex value, and I swear it's the most boring task I ever made in my life. Basically because the RISC-V instructions are not easy to map to their binary (for reasons related with the hardware implementation). ##### The eagles are coming! But I had a trick, a deus ex machina that would safe my life. During the last months I've been randomly working on a Scheme compiler for RISC-V assembly and that made me start making a [RISC-V assembly interpreter and compiler in python](http://git.elenq.tech/pysc-v/). It's still an early WIP, and was almost abandoned, but it has the basic machinery that lets me compile simple instructions to hex. With this dirty glue code I was able to compile the instructions one by one: ::python from registers.RV32I import * from InstructionSets.RV64I import * Regs = RegistersRV32I() def x(registerName): return Regs.getPos(registerName) def compile(instruction): hexrepr = hex(instruction.compile().value) hexval = hexrepr[2:] if len(hexval) < 8: hexval = "0" * (8 - len(hexval)) + hexval final = "" for i in range(0,8,2): final += hexval[i:i+2] + " " final = final.rstrip().upper() return " ".join(reversed(final.split(" "))) I just needed to open a python shell and write something like: ::python compile( addi(x("a0"), x("a1"), 12) ) And that would compile that instruction for me, giving me the output in a beautiful hexadecimal format. ::hex 13 85 C5 00 Not the best UX but usable enough for a small file like this. ##### The addresses The addresses are still something to solve. I'm an idiot so I counted the instructions by hand and then realized I had to expand some pseudoinstructions I forgot, so all the branch instructions were broken. *Yes, I'm like that*. Try to be smarter than I am. Use this trick: Leave all the instructions that use addresses set to a wrong address, like 0 or something, until you converted the whole file. Once you have that, resolve the addresses. That way you'll make sure every pseudoinstruction is expanded and you'll be able to use tools that will help you to choose addresses correctly. The trick I used was to add the ELF header, compile the file and then inspect the resulting binary. For the compilation there are two choices: we can use the assembly file we wrote previously simply assembling it to a binary and use it as the hex0 assembler; or we can use the high level C prototype that Stage0 provides. In any case, they have to give the same result. I still don't know why `objdump` is unable to process the binaries of the hex0 files but GDB is able to do it so... Launch a GDB as [I explained in the previous post](https://ekaitz.elenq.tech/lightening.html) and disassemble the whole file[^where]. It'll look like this: ::asm 0x0000000000600078: ld a0,0(sp) 0x000000000060007c: ld a1,8(sp) 0x0000000000600080: ld a2,16(sp) 0x0000000000600084: li s4,0 0x0000000000600088: li s5,0 0x000000000060008c: li a7,56 0x0000000000600090: li a0,-100 0x0000000000600094: mv a1,a2 0x0000000000600098: li a2,0 0x000000000060009c: ecall ... With that dissasembled result, we can literally make some math with the addresses and fix all the instructions. We just need to substract the target address from the current address in the branches, so we get the offset. > NOTE: Be careful with the `la` pseudoinstruction (`auipc + addi`). The base > address here is the one of the `auipc`, not the one of the `addi`. ##### The ELF header If we don't know how to make the ELF header, we can't make the previous step, so better if we mention something about it. Other files of the project are hex0 files too, so they also have a heavily commented ELF header we can use as a reference. Also, [wikipedia has a great explanation of it](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format#File_header). The main point we need to change is the `e_machine` field. We need to set it to `0xF3`, indicating RISC-V. Also we need to make sure the flag of 64 bit system has to be set for RV64 and remember to check the endianness. > NOTE: Big Endian it's the most natural way to write the file by hand. If you > want to go for Little Endian this might get weird to write. The python script > above uses Little Endian, watch the `reversed` call on it. ##### Debugging Once you have everything ready you need to make sure it's doing what it's supposed to. My first working program was failing to use an output file. Someone in the `#bootstrappable` IRC channel (I'm sorry, I can't remember who was) told me to `strace` the program to see what was going on, and with that and some debugging with GDB's `layout asm`, I was able to figure out one instruction was using a wrong register. These tools are important because the all the process is done by hand so there are many chances to screw up somewhere. `strace` is extremely handy in this specific program because most of the functionality it has is based on system calls. If you clean the output correctly you can see everything the program does accurately. ---