Title: Understanding GNU Mes' performance Date: 2025-06-02 Category: Tags: GNU Mes interpreter speedup Slug: fasterMes1 Lang: en Summary: How fast is GNU Mes' scheme interpreter really? Before[^heh] starting our work in [the project][project] we should, first, make sure that we know that our hypothesis is true and GNU Mes could be faster and, second, try to understand where we could improve the execution speed. [^heh]: Hypothetically speaking, because the work has already started. But ssssh... [project]: /tag/gnu-mes-interpreter-speedup.html ### The benchmarking tool In order to check how fast GNU Mes is, I created a very simple benchmarking tool. The tool just runs GNU Mes under `time` (the command, not the built-in), and creates a report in a bunch of text files. Of course, `time` is not a great way to benchmark, but my hypothesis was that `mes` (remember, the scheme interpreter in GNU Mes) could be way faster than it is, so, if that's true, `time` should be more than enough for the comparison. This is done in the `simple.make` makefile, that has been revived to serve this development period. The `simple.make` makefile is important because normally GNU Mes is built from scratch, including the `meslibc` (remember, the C standard library in GNU Mes), which takes a long time. The `simple.make` file builds just `mes`, the interpreter, from GCC or M2-Planet, so the iteration time is acceptable during the development, meaning it's almost instantaneous. So, in summary this: ``` bash make -f simple.make benchmark ``` Would run the benchmark for the default `mes`, the one built by GCC (which is of course built if needed, too). But we also have `benchmark-m2`, for the `mes` built by M2-Planet. The benchmark at this very moment runs three cases: - `fib`: A simple scheme program that finds the 35th element of the Fibonacci sequence. - `mescc-hello`: A `mescc` call that compiles a simple hello-world program written in C. - `mescc-mes`: A `mescc` call that compiles `mes`'s source code. These three cases cover very basic operations but let us know if our changes are relevant in `mes`'s main job: compiling some C code. But also give us some interesting feedback about the evaluation speed in the `fib` check, specially in a very recursive program. The `mescc-hello` is a minimal test that shows us the startup speed of the `mescc` compiler. However, that alone doesn't give us a lot of information, because we don't have anything to compare with. The scheme interpreter in GNU Mes is compatible with Guile meaning that it could be replaced with Guile in any moment (maybe not the other way around). There it is our great candidate for comparison. #### The data The following table shows the execution of all the benchmarks' results for wall-time written in `Average ±StdDev` form. Each benchmark is run 10 times in my laptop (*i7-10510U*). | | mes-gcc | mes-m2 | guile | guile | guile†‡ | |:--------------|-------------:|-------------:|------------:|------------------:|----------------------:| | Fib | 21.39 ±0.66 | 49.13 ±0.11 | 0.27 ±0.00 | 3.57 ±0.27 | 8.67 ±0.38 | | Mescc-hello | 1.38 ±0.04 | 2.69 ±0.01 | 0.41 ±0.01 | 0.45 ±0.03 | 1.03 ±0.04 | | Mescc-mes | 103.39 ±3.03 | 214.88 ±0.34 | 17.03 ±0.30 | 18.70 ±0.95 | 27.34 ±1.09 |
With GUILE_AUTO_COMPILE=0
With GUILE_JIT_THRESHOLD=-1
From left to right: _mes-gcc_ means `mes` built with default GCC (including optimizations), _mes-m2_ means `mes` built with M2-Planet and _guile_ means running `guile` (`v3.0.9`) instead of `mes`. The first conclusion is pretty obvious: > Guile is way faster than Mes when compiling C code, even with all the > optimizations disabled. ##### Unfair! But that conclusion may look unfair to most of you, specially because Guile is a pretty large and powerful scheme implementation, while Mes is a simple scheme interpreter designed to be as minimal as possible. Chibi-Scheme is a very interesting small scheme interpreter that happens to be somewhat slow comparing with other major scheme implementations. In terms of lines of code, `chibi-scheme` is around the twice the size of `mes`, which is still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is, so we can only run the `fib` benchmark with it. The following table shows a comparison between the best Mes scenario vs Chibi-Scheme. | | Mes | Chibi-Scheme | |:---------------------------------|------------:|-------------:| | Lines of Code | 5838 | 13135 | | Fib (s) | 21.39 ±0.66 | 1.77 ±0.03 |
Only counts the lines in .c files
Both built with GCC
That's quite an improvement for the size! #### Conclusion from the benchmark Before taking any conclusion, we should make sure our assumption is true and the differences are high enough to be captured by the `time` command. Looking to the orders of magnitude we are dealing with here, I would say **the data we obtained is a valid reference point**, as the differences are large enough to be properly captured by `time` and our goal is to have a noticeable performance improvement. From there, we understand the obvious: Guile is way faster than Mes. Even with compilation disabled, but that being a huge difference. Chibi-Scheme is also way faster, but not as fast as Guile with all optimizations enabled. When it comes to the compiler used to build Mes, when compiled with M2 `mes` is twice as slow as with GCC. This is a very relevant detail to keep in mind during development. GCC's optimizations account for a huge performance improvement in `mes`, and we cannot rely on them during the bootstrapping process. ### Profiling Still, we are far from justifying a full rewrite of the evaluator, and even further from rewriting `mes` as a bytecode interpreter, as proposed in the project presentation. That's why I decided to run a profiler, again, the simplest possible, so I could get some insights from the execution of the program. I run `gprof` in `mes` for both `fib` and `mescc-mes` cases[^benchmark-useful] and I summarize the output of the flat-profile below, right after a little reminder of what do all the columns in the table mean. Just to clarify, `gprof` requires the program to be built with GCC with `-pg` flag, we don't have profilers for the program built M2-Planet, but I think the insights obtained from this could be extrapolated to `mes-m2` in most cases. You'll see why in a minute. [^benchmark-useful]: See how the benchmark is useful? ###### Legend
% time
the percentage of the total running time of the program used by this function.
cumulative seconds
a running sum of the number of seconds accounted for by this function and those listed above it.
self seconds
the number of seconds accounted for by this function alone. This is the major sort for this listing.
calls
the number of times this function was invoked, if this function is profiled, else blank.
self ms/call
the average number of milliseconds spent in this function per call, if this function is profiled, else blank.
total ms/call
the average number of milliseconds spent in this function and its descendents per call, if this function is profiled, else blank.
name
the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed.
###### Gprof flat-profile for `fib` | %time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name | |------:|-------------------:|--------------:|----------:|-------------:|--------------:|:---------------------| | 20.06 | 4.01 | 4.01 | 37 | 108.38 | 416.96 | `eval_apply` | | 11.26 | 6.26 | 2.25 | 600202232 | 0.00 | 0.00 | `gc_push_frame` | | 7.10 | 7.68 | 1.42 | 600202232 | 0.00 | 0.00 | `gc_peek_frame` | | 6.83 | 9.05 | 1.37 | 481504722 | 0.00 | 0.00 | `make_cell` | | 6.03 | 10.25 | 1.21 | 600200793 | 0.00 | 0.00 | `push_cc` | | 5.48 | 11.35 | 1.10 | 300124024 | 0.00 | 0.00 | `struct_ref_` | | 4.70 | 12.29 | 0.94 | | | | `_init` | | 3.70 | 13.03 | 0.74 | 269868352 | 0.00 | 0.00 | `make_value_cell` | | 2.35 | 13.50 | 0.47 | 600202232 | 0.00 | 0.00 | `gc_pop_frame` | | 2.15 | 13.93 | 0.43 | 406099172 | 0.00 | 0.00 | `cons` | | 2.15 | 14.36 | 0.43 | 74983231 | 0.00 | 0.00 | `apply_builtin` | | 2.00 | 14.76 | 0.40 | 300279513 | 0.00 | 0.00 | `cell_ref` | | 1.95 | 15.15 | 0.39 | 300124024 | 0.00 | 0.00 | `assert_struct` | | 1.85 | 15.52 | 0.37 | 135158343 | 0.00 | 0.00 | `length__` | | 1.60 | 15.84 | 0.32 | 29862855 | 0.00 | 0.00 | `minus` | | 1.38 | 16.11 | 0.28 | 300124024 | 0.00 | 0.00 | `assert_range` | | 1.38 | 16.39 | 0.28 | 75315259 | 0.00 | 0.00 | `assq` | | 1.18 | 16.62 | 0.24 | 75280752 | 0.00 | 0.00 | `lookup_binding` | | 1.08 | 16.84 | 0.22 | 75280689 | 0.00 | 0.00 | `make_binding_` | | 1.05 | 17.05 | 0.21 | 269866060 | 0.00 | 0.00 | `make_number` | | 1.05 | 17.26 | 0.21 | 173 | 1.21 | 1.21 | `macro_set_x` | | 1.00 | 17.46 | 0.20 | 135178741 | 0.00 | 0.00 | `gc_check` | | 0.90 | 17.64 | 0.18 | 30093052 | 0.00 | 0.00 | `pairlis` | | 0.85 | 17.81 | 0.17 | 105069896 | 0.00 | 0.00 | `check_formals` | | 0.83 | 17.97 | 0.17 | 29861090 | 0.00 | 0.00 | `less_p` | | ... | ... | ... | ... | ... | ... | ... | | 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `open_boot` | | 0.00 | 19.99 | 0.00 | 1 | 0.00 | 3.09 | `read_boot` | | 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `reader_read_octal` | | 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `try_open_boot` | ###### Gprof flat-profile for `mescc-mes` | %time | cumulative seconds | self seconds | calls | self s/call | total s/call | name | |------:|-------------------:|--------------:|-----------:|------------:|-------------:|:---------------------| | 31.64 | 58.01 | 58.01 | 81 | 0.72 | 1.91 | `eval_apply` | | 7.53 | 71.82 | 13.81 | 428598051 | 0.00 | 0.00 | `assq` | | 6.11 | 83.03 | 11.21 | 2789709236 | 0.00 | 0.00 | `make_cell` | | 5.96 | 93.95 | 10.92 | 2109168010 | 0.00 | 0.00 | `gc_push_frame` | | 4.88 | 102.89 | 8.94 | 607787117 | 0.00 | 0.00 | `length__` | | 4.11 | 110.43 | 7.53 | 2109168010 | 0.00 | 0.00 | `gc_peek_frame` | | 3.61 | 117.04 | 6.61 | 891765473 | 0.00 | 0.00 | `struct_ref_` | | 3.39 | 123.25 | 6.22 | 2109164356 | 0.00 | 0.00 | `push_cc` | | 3.39 | 129.46 | 6.21 | | | | `_init` | | 2.27 | 133.63 | 4.17 | 1742421 | 0.00 | 0.00 | `assoc_string` | | 1.96 | 137.22 | 3.59 | 2235192336 | 0.00 | 0.00 | `cons` | | 1.79 | 140.50 | 3.28 | 215232553 | 0.00 | 0.00 | `apply_builtin` | | 1.54 | 143.33 | 2.83 | 679409558 | 0.00 | 0.00 | `make_value_cell` | | 1.49 | 146.06 | 2.73 | 200300883 | 0.00 | 0.00 | `pairlis` | | 1.18 | 148.22 | 2.16 | 2109168010 | 0.00 | 0.00 | `gc_pop_frame` | | 1.13 | 150.28 | 2.07 | 925236148 | 0.00 | 0.00 | `cell_ref` | | 1.10 | 152.30 | 2.02 | 196474232 | 0.00 | 0.00 | `gc_copy` | | 1.07 | 154.26 | 1.96 | 419658520 | 0.00 | 0.00 | `lookup_value` | | 1.04 | 156.17 | 1.91 | 411189650 | 0.00 | 0.00 | `check_formals` | | 1.02 | 158.03 | 1.87 | 423486766 | 0.00 | 0.00 | `lookup_binding` | | 0.98 | 159.82 | 1.79 | 183 | 0.01 | 0.01 | `gc_cellcpy` | | 0.91 | 161.50 | 1.68 | 891765473 | 0.00 | 0.00 | `assert_struct` | | 0.86 | 163.08 | 1.58 | 423486608 | 0.00 | 0.00 | `make_binding_` | | 0.78 | 164.51 | 1.43 | 183 | 0.01 | 0.03 | `gc_loop` | | 0.78 | 165.94 | 1.43 | 635403488 | 0.00 | 0.00 | `gc_check` | | ... | ... | ... | ... | ... | ... | ... | | 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `open_boot` | | 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `open_output_file` | | 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `read_boot` | | 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `reader_read_octal` | #### The profiling data in context The functions in the top are mostly the same in both cases: `eval_apply`, `gc_push_frame`, `push_cc`, `gc_peek_frame` and `make_cell`. They represent around the 50% of the execution time (51.21% in `fib` and 51.28% in `mescc-mes`) and they are closely related. Before trying to explain why, we could spend a minute checking how they are eating that amount of time. `eval_apply` is executed a few times but it takes a long time to run, while the rest of them are very fast functions (0.00 ms!) but they are executed millions of times per second. The reason for that is `eval_apply` is a long running loop that is calling them. It's calling them **all the time**. ##### Deep in `eval_apply` As I already explained in earlier episodes, `mes` is a tree-walk interpreter, but that doesn't say much about how things work. The current design of `mes` is very interesting. I didn't ask the author about it but it feels like it was written in Scheme first and translated to C later. I say this because the `eval_apply` function, the core of the program, that expands the macros, traverses the tree, and does all the `eval`ing and the `apply`ing, is written as it would be written in Scheme: abusing the tail-call optimization[^tco] with mutually recursive functions. [^tco]: Way before all the Scheme, compilers, interpreters and Guix, [I wrote a post about it]({filename}/call-me-maybe.md). That's pretty understandable, if you are writing a Scheme interpreter you'd like to write Scheme, so you Schemeify your C as much as you can. Also, it happens that Scheme can be very elegantly described in Scheme, as the [GNU Mes manual mentions][mes-manual], and most of the literature about Scheme interpreters uses Scheme. This tradition goes back to the SICP, and even further to the invention (discovery?) of LISP itself. [mes-manual]: https://www.gnu.org/software/mes/manual/mes.html#LISP-as-Maxwell_0027s-Equations-of-Software But C is not very compatible with that programming style. The C stack would overflow due to the lack of tail-call optimization. GNU Mes overcomes that limitation manually manipulating a stack, that is allocated by its memory manager, and emulating function calls by doing a `goto`. The argument passing to these "functions" is done with some global variables that are understood as "registers" in this "virtual machine". Consider this piece of code: ``` clike evlis: if (R1 == cell_nil) goto vm_return; if (R1->type != TPAIR) goto eval; push_cc (R1->car, R1, R0, cell_vm_evlis2); goto eval; evlis2: push_cc (R2->cdr, R1, R0, cell_vm_evlis3); goto evlis; evlis3: R1 = cons (R2, R1); goto vm_return; ``` The "registers" are those `R1`, `R0` and so on, and we don't really need to see what they mean to understand how this whole thing works. Imagine we enter `evlis` with an `R1` that is a list: it's not a `cell_nil` and it's type is `TPAIR`. So we skip both `if` statements and we reach the `push_cc`. This call will push to the stack our current position in the code, `cell_vm_evlis2`, and then jump to the `eval` "function". `eval` eventually ends, doing a `goto vm_return` which pops the current frame and jumps to the top of the `eval_apply` function. ``` clike vm_return: x = R1; // This saves the return value of the "call" gc_pop_frame (); R1 = x; // and puts it back, after it has been overwritten by the pop goto eval_apply; ``` And, what do we find there? A huge `if` cascade that brings us to the label where we left: ``` clike eval_apply: if (R3 == cell_vm_evlis2) goto evlis2; // Bingo! else if (R3 == cell_vm_evlis3) goto evlis3; else if (R3 == cell_vm_eval_check_func) goto eval_check_func; else if (R3 == cell_vm_eval2) goto eval2; // There are quite a few more below. ``` Didn't we just emulate a "function call"? I'm sure you can continue with the execution from `evlis2` by yourselves, just with the code I pasted here, and understand how a list is evaluated in `mes`. It might feel crazy, but that's because during the explanation the goal of this approach was left aside for a minute. I didn't mention that the `eval` label we jumped to might be jumping to many other places, including `apply`, which would jump to `eval` again, which, as we mentioned before, would blow up our stack if we were doing this in C functions. Doing it this (weird) way lets you control the stack and manually apply the tail-call optimization, i.e. just use `goto` without the `push_cc`. This whole thing is looping all the time, doing `goto eval_apply` over and over again, and that's why the actual `eval_apply` function takes that long to run. It's the main loop of the interpreter. Now we are in a great position to understand why those other functions were called that many times. `push_cc` is the easiest one, as it's what it prepares a call to one of those internal "functions" that require pushing to the stack. `gc_push_frame` is called by `push_cc` and is the function that actually does the pushing. `gc_peek_frame` is also called many times, but we don't see that in the listings I shared. What happens here is `gc_peek_frame` is called by `gc_pop_frame`, exactly what the `vm_return` block does. The `make_cell` case is slightly different but also related. It's called mostly from `cons`, and it's sibling `make_value_cell` is also called many times. These are the functions that allocate Scheme values. Of course, we cannot simply eliminate the calls to these functions, because a Scheme interpreter would always need to create values, but my bet is that these functions, specially `cons` are called more often that strictly needed because of the evaluation strategy I just described. ### Conclusions There are many conclusions to obtain from all this data and the design of the evaluator. The first and more obvious one is that it looks like the project proposal makes sense and **we could make `mes` way faster**, as Guile is, **even keeping the interpreter small**, as shown in Chibi-Scheme. Of course, I don't expect to obtain a 10x performance improvement in the first iteration, but I believe we can do a lot *"starting with the simplest thing that could possibly work*[^janneke]" if we keep a non-pessimization philosophy. [^janneke]: I'm quoting Janneke, the author of GNU Mes, who has a very pragmatical approach to programming. The question now is, what is *that* thing? I proposed to turn `mes` into a **bytecode interpreter**. This alone might sound as an overkill but from all the other options I considered it sounded like the simplest, and there is evidence that it could possibly work: Guile and Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks we saw they can be very fast comparing to GNU Mes' Scheme interpreter. Other possibilities like tree traversal memoization, like the ones applied in older versions of Guile didn't feel like they would have such a profound impact in the evaluation process, and I felt like they were going to be harder for me to implement. But still, there's a lot more we can obtain from the data. When comparing `mes-gcc` and `mes-m2`, we could see that small tricks would not work if we want to improve the speed of the bootstrapping process. It's not that hard to write some code in a different way so GCC does its magic and optimizes for a 10% improvement, but when it comes to M2-Planet we are alone, there's no compiler optimization there to help us, and **we must be smart and use optimizations that go deeper, those that happen in the CPU**. Bytecode could give us some of that *data locality* we need. We must remember the profiler was run on `mes-gcc`, but the information we obtained from that can be extrapolated to the `mes-m2` case: **it is the current implementation of `mes` that is inefficient by design as it was designed to be simple, not fast**. All the `push`ing and the `pop`ing of scheme values to that internal managed stack happens too often. An evaluator that is able to run an already processed program is, potentially, way faster than one that has to traverse the tree over and over again, while it does all this `goto` soup, that is also pretty hard to read. More specifically, a bytecode interpreter would not only reduce the execution time of those functions on the top in the profiler's list, but many others that are related to them in the very long tail of function calls that I didn't bother to paste in this post. **This cascade effect could add up to a huge performance improvement** further than that 50% of execution time we could find in those 5 functions on the top. Everything that can be processed ahead of time should be done (macro expansion, lambda hoisting, argument checking, compilation...) first and only once if possible, so the evaluator can focus on its main goal: running the code, as fast as possible. What I didn't mention, though, are all the functions that don't look relevant in the profiling: they are just good. The garbage collector and allocator in `mes` simply works and goes unnoticed in the profiling with a 1% of execution time in the `gc_copy` function and a very fast allocation that only becomes a noticeable when a huge number (hundreds of millions) of values are allocated. When it comes to readability and auditability, one of the goals of the project, personally I consider the `eval_apply` function is very hard to follow. There are still many tricks that I don't fully understand, and I already spent several weeks reading and hacking on the interpreter, which isn't that large after all. Rewriting it is a chance for making it easier to read. ### Work done Understanding the interpreter internals isn't an immediate task, so in order to get used to it I decided to go for some low-hanging fruit first. First, I moved some operations to builtins and even created some. The performance gain I saw was simply **zero**. Probably because the benchmark is not designed to detect small improvements and because the main performance bottleneck is in the `eval_apply` function. Same thing happened when updated the M2-Planet and I added `switch` statements that replace chained `if`s that were there because of M2-Planet didn't support `switch` statements in previous versions. The change affected `mes-gcc`'s performance slightly, but not enough to consider it a big win. `mes-m2` remained mostly the same as it doesn't apply any kind of optimization in `switch` statements. I still consider this change important for readability, so much so it helped me detect a bug in the string reader (the newline escaping was broken). That I also fixed. The `eval_apply` function has been a puzzle to me for a long time to the point that I started to look to any possibility to make it shorter, simpler or anything that could help me understand it. In that process I found some very specific cases in that `goto` soup that mentioned `pmatch`. These were the result of a hack that avoided a hygiene issue in the `pmatch` macro. I changed the variable names in the macro and removed those special cases, slightly reducing the complexity of the `eval_apply` function and bringing the attention to the lack of hygiene in our macro expander, which, by the way, is also intertwined in the `eval_apply` function. Well, and I wrote the benchmark tool, that happens to be very useful as a test for all these small changes, too. All the changes you can find in the `faster` branch in the following repo[^regret]: [^regret]: What you cannot see is all the changes I tried to do and I regretted later. ### Next steps None of the work done really tackles the root issue but hacking around is important to familiarize yourself with the code, and I'm the kind of person that requires quite a bit of context before getting into action. I think now I'm ready for the real deal: **the evaluator swallows most of the time, and we need to change it**. We just need a plan for it. First I have to prepare for the big change. Removing the macro expansion from the famous `eval_apply` function and reducing it to a plain Scheme evaluator (renamed to `eval_core`) that can only work with the basic forms and the builtins would help me in two fronts: working on the macro expansion alone, opening the door for writing a fully hygienic macro expander from scratch; and reducing the `eval_core` function even more so it can be replaced later. Currently, as I write this lines, I just finished the removal. I can `git show` and see all the macro expansion related functionality in red. Now I can get to the work. I have to write a macro expander. I'm yet to decide if I should write it first in Scheme (without macros) and translate it later or if I should write it directly in C. In any case, the macro expansion step would be separated from the evaluation. The separation is not absolute, because the macro expander would make use of the brand new `eval_core` function to run the macros, but at least we would have the two concerns clearly split in the code. Consider this the 0 step. With the split, it would be time to start thinking about the compilation and evaluation. In the first thought I didn't know very well what to do, because if the evaluator works on bytecode, how would I expand the macros then? I gave this too much thought, but it wasn't that difficult, really. If the macro expander is not only a macro expander but also a compiler, the body of the macros can be compiled as they come. The macro expansion would still work with the `eval_core` function, the same as it did, but now it wouldn't eval Scheme code, but bytecode. Even more, we could compile all the Scheme code as we read it for macro expansion, so we could use helper functions from the current file in the macro expansion step, and we would traverse the tree only once. Time for writing a macro expander, then. Another thing I never did before. I'm sure it will be a lot of fun.