From 0caec2aaba3446b4b23b91f2f07318c43359c196 Mon Sep 17 00:00:00 2001 From: Ekaitz Zarraga Date: Wed, 9 Jun 2021 16:55:03 +0200 Subject: Add post of hex0 to RISCV --- content/hex0.md | 450 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 450 insertions(+) create mode 100644 content/hex0.md diff --git a/content/hex0.md b/content/hex0.md new file mode 100644 index 0000000..1930885 --- /dev/null +++ b/content/hex0.md @@ -0,0 +1,450 @@ +Title: RISC-V Adventures II: hex0 +Date: 2021-06-08 +Category: +Tags: +Slug: hex0 +Lang: en +Summary: + A love story about trust, machine code, hexadecimal notation and weird + instruction formats, with an epic unexpected solution coming back from the + afterlife. + +[Stage0](https://github.com/oriansj/stage0) is a crazy project that is pretty +well aligned with our vision of trust, bootstrappable software and whatnot. + +During the last two weeks we have been working on the port of Stage0 to +RISC-V, providing the very first step of the process, so we came here to talk +about it, including a fantastic software necromancy moment you are going to +enjoy. + +### The origin of the times + +Once upon a time, software was written in machine code. Directly expressing the +machine instructions by the hands of the *programmers*. That was long, long, +time ago. + +One day some programmer decided to write a translator that mapped that machine +code to something more human readable and created what we call *assembly +language* today. It gained popularity and programmers decided to add more and +more functionalities to the *assembly language* until the point that what they +created was not a one-to-one mapping with machine code anymore. + +That's how the first *programming languages* were born. + +Their power was so immense that programmers decided to rewrite all their tools +using the new *programming languages*, they even wrote newer *programming +languages* with them. + +But power corrupts the mind of the fool. Blinded by the power of *programming +languages*, most of the programmers forgot the origin of the times, and +forgetting the history is always a mistake. + +*Epic music starts...* + + +### The problem + +> Warning: I oversimplified during the beginning of the post, but now... Oh +> boy! I'm going to flatten this shit. + +Well, we need auditable software. I don't think anyone can deny that fact. + +But what does "auditable software" mean? Isn't free software enough? + +It is true that the best way to audit stuff is to read the code of the +programs. It's the classic way we had to know if a program is doing what we +want it to. But, how can you be really sure the code you are reading is the one +that is shipping with your program? + +You can't! In general it's impossible to know. There are many reasons, but I +will oversimplify and give you just some thoughts and let the people from +[bootstrappable](http://bootstrappable.org/) do the dirty job. + +1. The compilation process is not reproducible, so the same source can result + in different binaries. You can't just compare different binaries to make + sure your compiler is compiled correctly. + +2. We have no way to solve the chicken-egg problem. The recipe to build the + compiler version X is to get the sources of the compiler version X and + compile them with the compiler version X-1. But how do you get the compiler + version X-1? Rinse and repeat. + Also... Where's the first version of your compiler? Does it run in modern + machines with modern operating systems? + +3. As there's no real way to get your compilers compiled by yourself, there's + no real way to be sure that the compilers are emitting the code they are + supposed to. You have to [**trust + them**](https://www.cs.umass.edu/~emery/classes/cmpsci691st/readings/Sec/Reflections-on-Trusting-Trust.pdf), + and you can't audit what you need to trust. + +The first point is where projects like Nix and Guix have sense: they try to +create reproducible stuff. Not only for compilation processes but also for +scientific studies (that have to be reproduced by other people, because... +that's how science works, isn't it?) and other things. Being able to create +identical environments where you can ensure that their (compilation) output is +going to be identical is extremely important, but I'll leave that for now. + +The second and third point are two different problems but they result in the +same: software distributions ship huge binary blobs (hundreds of megabytes) as +an starting point (bash, gcc...) so the users have no chance to check if those +binaries are corrupt. + +[GNU Mes](https://www.gnu.org/software/mes/) is a Scheme interpreter and C +compiler that was designed to reduce the size of the binaries you need to ship +with your distro. Mes has successfully reduced the size of the binaries that +need to be shipped with distros like Guix, but the project is more ambitious +that that. + +### Full-source bootstrap + +[Stage-0](https://savannah.nongnu.org/projects/stage0/) is a project that is +tackling the same "trusting trust" problem but from the opposite perspective, +starting in the low-level, rather than the high-level approach that GNU Mes +uses. + +Both projects work together to provide a greater goal: the full-source +bootstrap. The whole bootstrap is started from source, with no binaries +involved, so the distros don't need to ship binaries anymore. + +But how is that? + +### Hex0 + +Stage0 starts in the low level, the lowest possible, and builds more complex +programs from there, step by step. + +The first step, Hex0, is a self-hosting "assembler". I quote the word assembler +there because I think it's a very strong word for this: it's an ELF file +written in hexadecimal, with extra comments. + +Hex0 is able to compile itself to a binary ELF file, converting the Hexadecimal +values to the binary values and stripping the comments. + +We still have to compile the first Hex0 with something but that's not as +difficult as compiling the first GCC because basically Hex0 can be compiled by +very simple programs or even by hand, because it contains literally what is +going to be written in the final ELF. + +The comments on the Hex0 files describe the instructions on each of the lines +of the ELF file so the resulting files can be audited, instruction by +instruction, with the manual of the ISA as a reference. + +This starting point is more than enough to build on top of. We just need to add +more functionalities to the next steps: labels, constants... Until we are able +to compile a simple C compiler, Mes or anything. + +It's a clever solution for a crazy problem. + +### Hex0 in RISC-V: my experience + +So this is my blog and I come here to talk about myself![^1] + +Some weeks ago I had the chance to make that first step, Hex0, for RISC-V (64 +bit). You can take a look to [the code here](https://github.com/oriansj/bootstrap-seeds/pull/2/files). + +There you can see I added three files: the assembly file, the hex0 file and the +binary of the compiled hex0. They are basically the same thing, but they are +included for readability. + +#### The assembly + +The first step for me was to write the assembly file. It's easy once you know +how to make system calls in POSIX. + +**POSIX system calls in RISC-V** are pretty easy: + +- Load the arguments for the call in registers `a0`, `a1`... +- Load the *syscall number* in `a7` +- Run `ecall` + +The result of the system call comes in `a0`. + +**Input arguments** are also important, because we need to be able to tell to +Hex0 which is the file we want to compile, and where to put its output. + +That's pretty easy, input arguments are inserted in the stack, so we can load +them by *pop*-ing them. As in any C program, the first element we get is the +amount of arguments and the rest of them are the arguments themselves. + +Putting all together, if you take a look to the first block of the program: + + ::asm + _start: + ld a0, 0(sp) # Get number of the args + ld a1, 8(sp) # Get program name + ld a2, 16(sp) # Input file name + + ... + + # Open input file and store FD in s2 + li a7, 56 # sys_openat + li a0, -100 # AT_FDCWD + mv a1, a2 # input file + li a2, 0 # read only + ecall + mv s2, a0 # Save fd in for later + +In the first chunk we are reading the stack, element by element, and in the +second we are opening the input file, using the filename we just obtained from +the stack. + +Simple. + +As a note, I'd like to remind you to finish your program, because if you don't +it will continue to execute the memory after it and it'll explode in your face: + + ::asm + terminate: + # Terminate program with 0 return code + li a7, 93 # sys_exit + li a0, 0 # Return code 0 + ecall + +This tells the OS to finish the execution. + +The internals of the assembly file are simple, I won't explain them in detail. +It basically iterates character by character, removing the comments and +converting the hex value couples to a byte. + +Read it and tell me if you need help understanding it![^contact] + +#### The conversion to the Hex0 + +Hex0, as I said, is an ELF file, written in hexadecimal, so we need to compile +our assembly file to binary and represent each of the instructions in +hexadecimal. And we need to resolve all the labels to final addresses. + +There's no easy way to do it. I started doing it by hand, reading the RISC-V +spec and converting the instructions one by one. But I tackled several +difficulties doing that. + +Pseudoinstructions are expanded to more than one instruction so we need to be +careful in the comments and explain that correctly. Also, we need to resolve +the addresses accordingly. For example: + + ::asm + la a1, buffer + +This is a pseudoinstruction, and we need to resolve it to: + + ::asm + auipc a1, 0 + addi a1, a1, $OFFSET + +Where `$OFFSET` is the offset from that instruction to the label `buffer`. + +These kind of expansions change our perception of the amount of instructions we +have and we have to be extremely careful. I didn't even mention the case where +the offset is very large! That's another story (thankfully we don't have to +deal with yet). + +Once the pseudoinstructions are expanded we need to convert them to the hex +value, and I swear it's the most boring task I ever made in my life. Basically +because the RISC-V instructions are not easy to map to their binary (for +reasons related with the hardware implementation). + +##### The eagles are coming! + +But I had a trick, a deus ex machina that would safe my life. During the last +months I've been randomly working on a Scheme compiler for RISC-V assembly and +that made me start making a [RISC-V assembly interpreter and compiler in +python](http://git.elenq.tech/pysc-v/). It's still an early WIP, and was +almost abandoned, but it has the basic machinery that lets me compile simple +instructions to hex. + +With this dirty glue code I was able to compile the instructions one by one: + + + ::python + from registers.RV32I import * + from InstructionSets.RV64I import * + + Regs = RegistersRV32I() + + def x(registerName): + return Regs.getPos(registerName) + + def compile(instruction): + hexrepr = hex(instruction.compile().value) + hexval = hexrepr[2:] + if len(hexval) < 8: + hexval = "0" * (8 - len(hexval)) + hexval + + final = "" + for i in range(0,8,2): + final += hexval[i:i+2] + " " + final = final.rstrip().upper() + return " ".join(reversed(final.split(" "))) + + +I just needed to open a python shell and write something like: + + ::python + compile( addi(x("a0"), x("a1"), 12) ) + +And that would compile that instruction for me, giving me the output in a +beautiful hexadecimal format. + + ::hex + 13 85 C5 00 + +Not the best UX but usable enough for a small file like this. + +##### The addresses + +The addresses are still something to solve. + +I'm an idiot so I counted the instructions by hand and then realized I had to +expand some pseudoinstructions I forgot, so all the branch instructions were +broken. *Yes, I'm like that*. + +Try to be smarter than I am. Use this trick: + +Leave all the instructions that use addresses set to a wrong address, like 0 or +something, until you converted the whole file. Once you have that, resolve the +addresses. That way you'll make sure every pseudoinstruction is expanded and +you'll be able to use tools that will help you to choose addresses correctly. + +The trick I used was to add the ELF header, compile the file and then inspect +the resulting binary. + +For the compilation there are two choices: we can use the assembly file we +wrote previously simply assembling it to a binary and use it as the hex0 +assembler; or we can use the high level C prototype that Stage0 provides. In +any case, they have to give the same result. + +I still don't know why `objdump` is unable to process the binaries of the hex0 +files but GDB is able to do it so... Launch a GDB as [I explained in the +previous post](https://ekaitz.elenq.tech/lightening.html) and disassemble the +whole file[^where]. + +It'll look like this: + + ::asm + 0x0000000000600078: ld a0,0(sp) + 0x000000000060007c: ld a1,8(sp) + 0x0000000000600080: ld a2,16(sp) + 0x0000000000600084: li s4,0 + 0x0000000000600088: li s5,0 + 0x000000000060008c: li a7,56 + 0x0000000000600090: li a0,-100 + 0x0000000000600094: mv a1,a2 + 0x0000000000600098: li a2,0 + 0x000000000060009c: ecall + ... + +With that dissasembled result, we can literally make some math with the +addresses and fix all the instructions. We just need to substract the target +address from the current address in the branches, so we get the offset. + +> NOTE: Be careful with the `la` pseudoinstruction (`auipc + addi`). The base +> address here is the one of the `auipc`, not the one of the `addi`. + +##### The ELF header + +If we don't know how to make the ELF header, we can't make the previous step, +so better if we mention something about it. + +Other files of the project are hex0 files too, so they also have a heavily +commented ELF header we can use as a reference. Also, [wikipedia has a great +explanation of it](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format#File_header). + +The main point we need to change is the `e_machine` field. We need to set it to +`0xF3`, indicating RISC-V. Also we need to make sure the flag of 64 bit system +has to be set for RV64 and remember to check the endianness. + +> NOTE: Big Endian it's the most natural way to write the file by hand. If you +> want to go for Little Endian this might get weird to write. The python script +> above uses Little Endian, watch the `reversed` call on it. + + +##### Debugging + +Once you have everything ready you need to make sure it's doing what it's +supposed to. + +My first working program was failing to use an output file. Someone in the +`#bootstrappable` IRC channel (I'm sorry, I can't remember who was) told me to +`strace` the program to see what was going on, and with that and some debugging +with GDB's `layout asm`, I was able to figure out one instruction was using a +wrong register. + +These tools are important because the all the process is done by hand so there +are many chances to screw up somewhere. + +`strace` is extremely handy in this specific program because most of the +functionality it has is based on system calls. If you clean the output +correctly you can see everything the program does accurately. + +--- + +
+Remember you can hire ElenQ +Technology to help you with your research, development or training.
+If you want to encourage my free software work you can support me on Buy Me a Coffee or on Liberapay. +
+ +--- + +### Final thoughts + +This contribution have been a lot of fun. It let me understand a little bit +more about the ecosystem around the full-source bootstrap, which is kinda +complex and includes some other stuff I didn't even mention. + +I learned a lot from this, now I have a deeper understanding of the instruction +formats on RISC-V and I learned some cool GDB tricks that are always useful. + +Stage0 has a really interesting approach for auditability that's worth thinking +about. They build everything from a commented binary file (that's basically +what hex0 is) that acts like a seed, so we can audit everything, including the +very first step. The solution of having the contents of the ELF file directly +written in hexadecimal is enough to ensure we can certify the contents are what +we expect, and having every instruction commented with its assembly counterpart +gives us the chance to go to the ISA and check if that's actually what it is +supposed to be. Perfectly auditable. + +Using this first step as a building block for anything else ensures that we +never need to rely on a binary file we can't know where does it come from. + +Really interesting stuff. + +Now we have this very first step ported to an open instruction set, what also +opens the door to the auditability from the hardware perspective. Now we can +start thinking about having an auditable software stack in a device we designed +ourselves, so we can audit it too. This is huge. + +Now, we need to keep pushing in this direction, porting all the rest of the +steps of Stage0, Mes and many other projects, if we want to reach a full RISC-V +support. This is just one small step in that direction. + +Hey! I almost forgot! And thanks to this I had the chance to work a little bit +more on my assembly interpreter, and recover it from the darkness. That's also +great. Isn't it? + +> Well, so we learned some things today, but the most important is that all the +> stupid things we do, all the random projects we work on, all the experiences +> we have in life are not just a *waste of time*. They may appear to be useful +> in the future, but you don't know when... +> What is sure here is if you don't stay creative and active, you'll never +> have any experience to learn from. + + +[^1]: Not really, in fact, I use my experience as a vehicle to introduce you to + great projects and interesting pieces of knowledge. But ssssh, don't tell + anyone. + +[^contact]: Contact me, seriously. I have my [contact info here](https://ekaitz.elenq.tech/pages/about.html) + +[^where]: If you want to know where to start to disassemble, you can just ask + `where` to GDB. -- cgit v1.2.3