fasterMes: 01_performance_report.md: Add it.

author: Ekaitz Zarraga <ekaitz@elenq.tech> 2025-06-03 00:48:17 +0200
committer: Ekaitz Zarraga <ekaitz@elenq.tech> 2025-06-04 23:31:02 +0200
commit: 3a3322ca63c300abc56c4a5452150fdfdd73cd44 (patch)
tree: bf2dd0485935e2d7ca00a30a17e9b58595ee1b69 /content/fasterMes/01_performance_report.md
parent: 39af79fd69c15fa001129567ec6508a5c0a95177 (diff)
1 files changed, 547 insertions, 0 deletions
diff --git a/content/fasterMes/01_performance_report.md b/content/fasterMes/01_performance_report.md
new file mode 100644
index 0000000..fb219b8
--- /dev/null
+++ b/content/fasterMes/01_performance_report.md
@@ -0,0 +1,547 @@
+Title: Understanding GNU Mes' performance
+Date: 2025-06-02
+Category:
+Tags: GNU Mes interpreter speedup
+Slug: fasterMes1
+Lang: en
+Summary: How fast is GNU Mes' scheme interpreter really?
+Status: draft
+
+Before[^heh] starting our work in [the project][project] we should, first, make
+sure that we know that our hypothesis is true and GNU Mes could be faster and,
+second, try to understand where we could improve the execution speed.
+
+[^heh]: Hypothetically speaking, because the work has already started. But
+    ssssh...
+
+[project]: /tag/gnu-mes-interpreter-speedup.html
+
+### The benchmarking tool
+
+In order to check how fast GNU Mes is, I created a very simple benchmarking
+tool. The tool just runs GNU Mes under `time` (the command, not the built-in),
+and creates a report in a bunch of text files. Of course, `time` is not a great
+way to benchmark, but my hypothesis was that `mes` (remember, the scheme
+interpreter in GNU Mes) could be way faster than it is, so, if that's true,
+`time` should be more than enough for the comparison.
+
+This is done in the `simple.make` makefile, that has been revived to serve this
+development period. The `simple.make` makefile is important because normally
+GNU Mes is built from scratch, including the `meslibc` (remember, the C
+standard library in GNU Mes), which takes a long time. The `simple.make` file
+builds just `mes`, the interpreter, from GCC or M2-Planet, so the iteration
+time is acceptable during the development, meaning it's almost instantaneous.
+
+So, in summary this:
+
+``` bash
+make -f simple.make benchmark
+```
+
+Would run the benchmark for the default `mes`, the one built by GCC (which is
+of course built if needed, too). But we also have `benchmark-m2`, for the `mes`
+built by M2-Planet.
+
+The benchmark at this very moment runs three cases:
+
+- `fib`: A simple scheme program that finds the 35th element of the Fibonacci
+  sequence.
+- `mescc-hello`: A `mescc` call that compiles a simple hello-world program
+  written in C.
+- `mescc-mes`: A `mescc` call that compiles `mes`'s source code.
+
+These three cases cover very basic operations but let us know if our changes are
+relevant in `mes`'s main job: compiling some C code. But also give us some
+interesting feedback about the evaluation speed in the `fib` check, specially
+in a very recursive program. The `mescc-hello` is a minimal test that shows us
+the startup speed of the `mescc` compiler.
+
+However, that alone doesn't give us a lot of information, because we don't
+have anything to compare with. The scheme interpreter in GNU Mes is compatible
+with Guile meaning that it could be replaced with Guile in any moment (maybe
+not the other way around). There it is our great candidate for comparison.
+
+#### The data
+
+The following table shows the execution of all the benchmarks' results for
+wall-time written in `Average ±StdDev` form. Each benchmark is run 10 times in
+my laptop (*i7-10510U*).
+
+|               | mes-gcc      | mes-m2       | guile       | guile<sup>†</sup> | guile<sup>†‡</sup>    |
+|:--------------|-------------:|-------------:|------------:|------------------:|----------------------:|
+| Fib           |  21.39 ±0.66 |  49.13 ±0.11 |  0.27 ±0.00 |        3.57 ±0.27 |            8.67 ±0.38 |
+| Mescc-hello   |   1.38 ±0.04 |   2.69 ±0.01 |  0.41 ±0.01 |        0.45 ±0.03 |            1.03 ±0.04 |
+| Mescc-mes     | 103.39 ±3.03 | 214.88 ±0.34 | 17.03 ±0.30 |       18.70 ±0.95 |           27.34 ±1.09 |
+
+<dl>
+<dt>†</dt> <dd>With <code>GUILE_AUTO_COMPILE=0</code></dd>
+<dt>‡</dt> <dd>With <code>GUILE_JIT_THRESHOLD=-1</code></dd>
+</dl>
+
+From left to right: _mes-gcc_ means `mes` built with default GCC (including
+optimizations), _mes-m2_ means `mes` built with M2-Planet and _guile_ means
+running `guile` (`v3.0.9`) instead of `mes`.
+
+The first conclusion is pretty obvious:
+
+>  Guile is way faster than Mes when compiling C code, even with all the
+>  optimizations disabled.
+
+##### Unfair!
+
+But that conclusion may look unfair to most of you, specially because Guile is
+a pretty large and powerful scheme implementation, while Mes is a simple scheme
+interpreter designed to be as minimal as possible.
+
+Chibi-Scheme is a very interesting small scheme interpreter that happens to be
+somewhat slow comparing with other major scheme implementations. In terms of
+lines of code, `chibi-scheme` is around the twice the size of `mes`, which is
+still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is,
+so we can only run the `fib` benchmark with it. The following table shows a
+comparison between the best Mes scenario vs Chibi-Scheme.
+
+|                                  | Mes         | Chibi-Scheme |
+|:---------------------------------|------------:|-------------:|
+| Lines of Code<sup>†</sup>        |        5838 |        13135 |
+| Fib<sup>‡</sup> (s)              | 21.39 ±0.66 |   1.77 ±0.03 |
+
+<dl>
+<dt>†</dt> <dd>Only counts the lines in <code>.c</code> files</dd>
+<dt>‡</dt> <dd>Both built with GCC</dd>
+</dl>
+
+That's quite an improvement for the size!
+
+#### Conclusion from the benchmark
+
+Before taking any conclusion, we should make sure our assumption is true and
+the differences are high enough to be captured by the `time` command. Looking
+to the orders of magnitude we are dealing with here, I would say **the data we
+obtained is a valid reference point**, as the differences are large enough to
+be properly captured by `time` and our goal is to have a noticeable performance
+improvement.
+
+From there, we understand the obvious: Guile is way faster than Mes. Even with
+compilation disabled, but that being a huge difference. Chibi-Scheme is also
+way faster, but not as fast as Guile with all optimizations enabled.
+
+When it comes to the compiler used to build Mes, when compiled with M2 `mes` is
+twice as slow as with GCC. This is a very relevant detail to keep in mind
+during development. GCC's optimizations account for a huge performance
+improvement in `mes`, and we cannot rely on them during the bootstrapping
+process.
+
+
+### Profiling
+
+Still, we are far from justifying a full rewrite of the evaluator, and even
+further from rewriting `mes` as a bytecode interpreter, as proposed in the
+project presentation. That's why I decided to run a profiler, again, the
+simplest possible, so I could get some insights from the execution of the
+program.
+
+I run `gprof` in `mes` for both `fib` and `mescc-mes` cases[^benchmark-useful]
+and I summarize the output of the flat-profile below, right after a little
+reminder of what do all the columns in the table mean.
+
+Just to clarify, `gprof` requires the program to be built with GCC with `-pg`
+flag, we don't have profilers for the program built M2-Planet, but I think the
+insights obtained from this could be extrapolated to `mes-m2` in most cases.
+You'll see why in a minute.
+
+[^benchmark-useful]: See how the benchmark is useful?
+
+###### Legend
+
+<dl>
+<dt>% time</dt>
+<dd>the percentage of the total running time of the program
+used by this function.</dd>
+
+<dt>cumulative seconds</dt>
+<dd>a running sum of the number of seconds accounted for by this
+function and those listed above it.</dd>
+
+<dt>self seconds</dt>
+<dd>the number of seconds accounted for by this function alone. This is
+the major sort for this listing.</dd>
+
+<dt>calls</dt>
+<dd>the number of times this function was invoked, if this function is
+profiled, else blank.</dd>
+
+<dt>self ms/call</dt>
+<dd>the average number of milliseconds spent in this function per call,
+if this function is profiled, else blank.</dd>
+
+<dt>total ms/call</dt>
+<dd>the average number of milliseconds spent in this function and its
+descendents per call, if this function is profiled, else blank.</dd>
+
+<dt>name</dt>
+<dd>the name of the function.  This is the minor sort for this listing.
+The index shows the location of the function in the gprof listing. If
+the index is in parenthesis it shows where it would appear in the gprof
+listing if it were to be printed.</dd>
+</dl>
+
+###### Gprof flat-profile for `fib`
+
+| %time | cumulative seconds | self seconds  |     calls | self ms/call | total ms/call | name                 |
+|------:|-------------------:|--------------:|----------:|-------------:|--------------:|:---------------------|
+| 20.06 |               4.01 |         4.01  |        37 |       108.38 |        416.96 | `eval_apply`         |
+| 11.26 |               6.26 |         2.25  | 600202232 |         0.00 |          0.00 | `gc_push_frame`      |
+|  7.10 |               7.68 |         1.42  | 600202232 |         0.00 |          0.00 | `gc_peek_frame`      |
+|  6.83 |               9.05 |         1.37  | 481504722 |         0.00 |          0.00 | `make_cell`          |
+|  6.03 |              10.25 |         1.21  | 600200793 |         0.00 |          0.00 | `push_cc`            |
+|  5.48 |              11.35 |         1.10  | 300124024 |         0.00 |          0.00 | `struct_ref_`        |
+|  4.70 |              12.29 |         0.94  |           |              |               | `_init`              |
+|  3.70 |              13.03 |         0.74  | 269868352 |         0.00 |          0.00 | `make_value_cell`    |
+|  2.35 |              13.50 |         0.47  | 600202232 |         0.00 |          0.00 | `gc_pop_frame`       |
+|  2.15 |              13.93 |         0.43  | 406099172 |         0.00 |          0.00 | `cons`               |
+|  2.15 |              14.36 |         0.43  |  74983231 |         0.00 |          0.00 | `apply_builtin`      |
+|  2.00 |              14.76 |         0.40  | 300279513 |         0.00 |          0.00 | `cell_ref`           |
+|  1.95 |              15.15 |         0.39  | 300124024 |         0.00 |          0.00 | `assert_struct`      |
+|  1.85 |              15.52 |         0.37  | 135158343 |         0.00 |          0.00 | `length__`           |
+|  1.60 |              15.84 |         0.32  |  29862855 |         0.00 |          0.00 | `minus`              |
+|  1.38 |              16.11 |         0.28  | 300124024 |         0.00 |          0.00 | `assert_range`       |
+|  1.38 |              16.39 |         0.28  |  75315259 |         0.00 |          0.00 | `assq`               |
+|  1.18 |              16.62 |         0.24  |  75280752 |         0.00 |          0.00 | `lookup_binding`     |
+|  1.08 |              16.84 |         0.22  |  75280689 |         0.00 |          0.00 | `make_binding_`      |
+|  1.05 |              17.05 |         0.21  | 269866060 |         0.00 |          0.00 | `make_number`        |
+|  1.05 |              17.26 |         0.21  |       173 |         1.21 |          1.21 | `macro_set_x`        |
+|  1.00 |              17.46 |         0.20  | 135178741 |         0.00 |          0.00 | `gc_check`           |
+|  0.90 |              17.64 |         0.18  |  30093052 |         0.00 |          0.00 | `pairlis`            |
+|  0.85 |              17.81 |         0.17  | 105069896 |         0.00 |          0.00 | `check_formals`      |
+|  0.83 |              17.97 |         0.17  |  29861090 |         0.00 |          0.00 | `less_p`             |
+|   ... |                ... |          ...  |       ... |          ... |           ... | ...                  |
+|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          0.00 | `open_boot`          |
+|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          3.09 | `read_boot`          |
+|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          0.00 | `reader_read_octal`  |
+|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          0.00 | `try_open_boot`      |
+
+###### Gprof flat-profile for `mescc-mes`
+
+| %time | cumulative seconds | self seconds  |     calls  | self s/call | total s/call | name                 |
+|------:|-------------------:|--------------:|-----------:|------------:|-------------:|:---------------------|
+| 31.64 |              58.01 |         58.01 |         81 |        0.72 |         1.91 | `eval_apply`         |
+|  7.53 |              71.82 |         13.81 |  428598051 |        0.00 |         0.00 | `assq`               |
+|  6.11 |              83.03 |         11.21 | 2789709236 |        0.00 |         0.00 | `make_cell`          |
+|  5.96 |              93.95 |         10.92 | 2109168010 |        0.00 |         0.00 | `gc_push_frame`      |
+|  4.88 |             102.89 |          8.94 |  607787117 |        0.00 |         0.00 | `length__`           |
+|  4.11 |             110.43 |          7.53 | 2109168010 |        0.00 |         0.00 | `gc_peek_frame`      |
+|  3.61 |             117.04 |          6.61 |  891765473 |        0.00 |         0.00 | `struct_ref_`        |
+|  3.39 |             123.25 |          6.22 | 2109164356 |        0.00 |         0.00 | `push_cc`            |
+|  3.39 |             129.46 |          6.21 |            |             |              | `_init`              |
+|  2.27 |             133.63 |          4.17 |    1742421 |        0.00 |         0.00 | `assoc_string`       |
+|  1.96 |             137.22 |          3.59 | 2235192336 |        0.00 |         0.00 | `cons`               |
+|  1.79 |             140.50 |          3.28 |  215232553 |        0.00 |         0.00 | `apply_builtin`      |
+|  1.54 |             143.33 |          2.83 |  679409558 |        0.00 |         0.00 | `make_value_cell`    |
+|  1.49 |             146.06 |          2.73 |  200300883 |        0.00 |         0.00 | `pairlis`            |
+|  1.18 |             148.22 |          2.16 | 2109168010 |        0.00 |         0.00 | `gc_pop_frame`       |
+|  1.13 |             150.28 |          2.07 |  925236148 |        0.00 |         0.00 | `cell_ref`           |
+|  1.10 |             152.30 |          2.02 |  196474232 |        0.00 |         0.00 | `gc_copy`            |
+|  1.07 |             154.26 |          1.96 |  419658520 |        0.00 |         0.00 | `lookup_value`       |
+|  1.04 |             156.17 |          1.91 |  411189650 |        0.00 |         0.00 | `check_formals`      |
+|  1.02 |             158.03 |          1.87 |  423486766 |        0.00 |         0.00 | `lookup_binding`     |
+|  0.98 |             159.82 |          1.79 |        183 |        0.01 |         0.01 | `gc_cellcpy`         |
+|  0.91 |             161.50 |          1.68 |  891765473 |        0.00 |         0.00 | `assert_struct`      |
+|  0.86 |             163.08 |          1.58 |  423486608 |        0.00 |         0.00 | `make_binding_`      |
+|  0.78 |             164.51 |          1.43 |        183 |        0.01 |         0.03 | `gc_loop`            |
+|  0.78 |             165.94 |          1.43 |  635403488 |        0.00 |         0.00 | `gc_check`           |
+|   ... |                ... |          ...  |        ... |         ... |          ... |  ...                 |
+|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `open_boot`          |
+|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `open_output_file`   |
+|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `read_boot`          |
+|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `reader_read_octal`  |
+
+#### The profiling data in context
+
+The functions in the top are mostly the same in both cases: `eval_apply`,
+`gc_push_frame`, `push_cc`, `gc_peek_frame` and `make_cell`. They represent
+around the 50% of the execution time (51.21% in `fib` and 51.28% in
+`mescc-mes`) and they are closely related.
+
+Before trying to explain why, we could spend a minute checking how they are
+eating that amount of time. `eval_apply` is executed a few times but it takes a
+long time to run, while the rest of them are very fast functions (0.00 ms!) but
+they are executed millions of times per second.
+
+The reason for that is `eval_apply` is a long running loop that is calling
+them. It's calling them **all the time**.
+
+##### Deep in `eval_apply`
+
+As I already explained in earlier episodes, `mes` is a tree-walk interpreter,
+but that doesn't say much about how things work.
+
+The current design of `mes` is very interesting. I didn't ask the author about
+it but it feels like it was written in Scheme first and translated to C later.
+I say this because the `eval_apply` function, the core of the program, that
+expands the macros, traverses the tree, and does all the `eval`ing and the
+`apply`ing, is written as it would be written in Scheme: abusing the tail-call
+optimization[^tco] with mutually recursive functions.
+
+[^tco]: Way before all the Scheme, compilers, interpreters and Guix, [I
+    wrote a post about it]({filename}/call-me-maybe.md).
+
+That's pretty understandable, if you are writing a Scheme interpreter you'd
+like to write Scheme, so you Schemeify your C as much as you can. Also, it
+happens that Scheme can be very elegantly described in Scheme, as the [GNU Mes
+manual mentions][mes-manual], and most of the literature about Scheme
+interpreters uses Scheme. This tradition goes back to the SICP, and even
+further to the invention (discovery?) of LISP itself.
+
+[mes-manual]: https://www.gnu.org/software/mes/manual/mes.html#LISP-as-Maxwell_0027s-Equations-of-Software
+
+But C is not very compatible with that programming style. The C stack would
+overflow due to the lack of tail-call optimization.
+
+GNU Mes overcomes that limitation manually manipulating a stack, that is
+allocated by its memory manager, and emulating function calls by doing a
+`goto`. The argument passing to these "functions" is done with some global
+variables that are understood as "registers" in this "virtual machine".
+
+Consider this piece of code:
+
+``` clike
+evlis:
+  if (R1 == cell_nil)
+    goto vm_return;
+  if (R1->type != TPAIR)
+    goto eval;
+  push_cc (R1->car, R1, R0, cell_vm_evlis2);
+  goto eval;
+evlis2:
+  push_cc (R2->cdr, R1, R0, cell_vm_evlis3);
+  goto evlis;
+evlis3:
+  R1 = cons (R2, R1);
+  goto vm_return;
+```
+
+The "registers" are those `R1`, `R0` and so on, and we don't really need to see
+what they mean to understand how this whole thing works.
+
+Imagine we enter `evlis` with an `R1` that is a list: it's not a `cell_nil` and
+it's type is `TPAIR`. So we skip both `if` statements and we reach the
+`push_cc`. This call will push to the stack our current position in the code,
+`cell_vm_evlis2`, and then jump to the `eval` "function". `eval` eventually
+ends, doing a `goto vm_return` which pops the current frame and jumps to the
+top of the `eval_apply` function.
+
+``` clike
+vm_return:
+  x = R1;         // This saves the return value of the "call"
+  gc_pop_frame ();
+  R1 = x;         // and puts it back, after it has been overwritten by the pop
+  goto eval_apply;
+```
+
+And, what do we find there?
+
+A huge `if` cascade that brings us to the label where we left:
+
+``` clike
+eval_apply:
+  if (R3 == cell_vm_evlis2)
+    goto evlis2;     // Bingo!
+  else if (R3 == cell_vm_evlis3)
+    goto evlis3;
+  else if (R3 == cell_vm_eval_check_func)
+    goto eval_check_func;
+  else if (R3 == cell_vm_eval2)
+    goto eval2;
+ // There are quite a few more below.
+```
+
+Didn't we just emulate a "function call"?
+
+I'm sure you can continue with the execution from `evlis2` by yourselves, just
+with the code I pasted here, and understand how a list is evaluated in `mes`.
+
+It might feel crazy, but that's because during the explanation the goal of this
+approach was left aside for a minute. I didn't mention that the `eval` label we
+jumped to might be jumping to many other places, including `apply`, which would
+jump to `eval` again, which, as we mentioned before, would blow up our stack if
+we were doing this in C functions. Doing it this (weird) way lets you control
+the stack and manually apply the tail-call optimization, i.e. just use `goto`
+without the `push_cc`.
+
+This whole thing is looping all the time, doing `goto eval_apply` over and over
+again, and that's why the actual `eval_apply` function takes that long to run.
+It's the main loop of the interpreter.
+
+Now we are in a great position to understand why those other functions were
+called that many times. `push_cc` is the easiest one, as it's what it prepares
+a call to one of those internal "functions" that require pushing to the stack.
+`gc_push_frame` is called by `push_cc` and is the function that actually does
+the pushing. `gc_peek_frame` is also called many times, but we don't see that
+in the listings I shared. What happens here is `gc_peek_frame` is called by
+`gc_pop_frame`, exactly what the `vm_return` block does.
+
+The `make_cell` case is slightly different but also related. It's called mostly
+from `cons`, and it's sibling `make_value_cell` is also called many times.
+These are the functions that allocate Scheme values. Of course, we cannot
+simply eliminate the calls to these functions, because a Scheme interpreter
+would always need to create values, but my bet is that these functions,
+specially `cons` are called more often that strictly needed because of the
+evaluation strategy I just described.
+
+### Conclusions
+
+There are many conclusions to obtain from all this data and the design of the
+evaluator.
+
+The first and more obvious one is that it looks like the project proposal makes
+sense and **we could make `mes` way faster**, as Guile is, **even keeping the
+interpreter small**, as shown in Chibi-Scheme. Of course, I don't expect to
+obtain a 10x performance improvement in the first iteration, but I believe we
+can do a lot *"starting with the simplest thing that could possibly
+work*[^janneke]" if we keep a non-pessimization philosophy.
+
+[^janneke]: I'm quoting Janneke, the author of GNU Mes, who has a very
+    pragmatical approach to programming.
+
+The question now is, what is *that* thing?
+
+I proposed to turn `mes` into a **bytecode interpreter**. This alone might
+sound as an overkill but from all the other options I considered it sounded
+like the simplest, and there is evidence that it could possibly work: Guile and
+Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks
+we saw they can be very fast comparing to GNU Mes' Scheme interpreter.
+
+Other possibilities like tree traversal memoization, like the ones applied in
+older versions of Guile didn't feel like they would have such a profound impact
+in the evaluation process, and I felt like they were going to be harder for me
+to implement.
+
+But still, there's a lot more we can obtain from the data.
+
+When comparing `mes-gcc` and `mes-m2`, we could see that small tricks would not
+work if we want to improve the speed of the bootstrapping process. It's not
+that hard to write some code in a different way so GCC does its magic and
+optimizes for a 10% improvement, but when it comes to M2-Planet we are alone,
+there's no compiler optimization there to help us, and **we must be smart and
+use optimizations that go deeper, those that happen in the CPU**. Bytecode
+could give us some of that *data locality* we need.
+
+We must remember the profiler was run on `mes-gcc`, but the information we
+obtained from that can be extrapolated to the `mes-m2` case: **it is the
+current implementation of `mes` that is inefficient by design as it was
+designed to be simple, not fast**. All the `push`ing and the `pop`ing of scheme
+values to that internal managed stack happens too often.
+
+An evaluator that is able to run an already processed program is, potentially,
+way faster than one that has to traverse the tree over and over again, while it
+does all this `goto` soup, that is also pretty hard to read. More specifically,
+a bytecode interpreter would not only reduce the execution time of those
+functions on the top in the profiler's list, but many others that are related
+to them in the very long tail of function calls that I didn't bother to paste
+in this post. **This cascade effect could add up to a huge performance
+improvement** further than that 50% of execution time we could find in those 5
+functions on the top.
+
+Everything that can be processed ahead of time should be done (macro expansion,
+lambda hoisting, argument checking, compilation...) first and only once if
+possible, so the evaluator can focus on its main goal: running the code, as
+fast as possible.
+
+What I didn't mention, though, are all the functions that don't look relevant
+in the profiling: they are just good. The garbage collector and allocator in
+`mes` simply works and goes unnoticed in the profiling with a 1% of execution
+time in the `gc_copy` function and a very fast allocation that only becomes a
+noticeable when a huge number (hundreds of millions) of values are allocated.
+
+When it comes to readability and auditability, one of the goals of the project,
+personally I consider the `eval_apply` function is very hard to follow. There
+are still many tricks that I don't fully understand, and I already spent
+several weeks reading and hacking on the interpreter, which isn't that large
+after all. Rewriting it is a chance for making it easier to read.
+
+
+### Work done
+
+Understanding the interpreter internals isn't an immediate task, so in order to
+get used to it I decided to go for some low-hanging fruit first.
+
+First, I moved some operations to builtins and even created some. The
+performance gain I saw was simply **zero**. Probably because the benchmark is
+not designed to detect small improvements and because the main performance
+bottleneck is in the `eval_apply` function.
+
+Same thing happened when updated the M2-Planet and I added `switch` statements
+that replace chained `if`s that were there because of M2-Planet didn't support
+`switch` statements in previous versions. The change affected `mes-gcc`'s
+performance slightly, but not enough to consider it a big win. `mes-m2`
+remained mostly the same as it doesn't apply any kind of optimization in
+`switch` statements. I still consider this change important for readability, so
+much so it helped me detect a bug in the string reader (the newline escaping
+was broken). That I also fixed.
+
+The `eval_apply` function has been a puzzle to me for a long time to the point
+that I started to look to any possibility to make it shorter, simpler or
+anything that could help me understand it. In that process I found some very
+specific cases in that `goto` soup that mentioned `pmatch`. These were the
+result of a hack that avoided a hygiene issue in the `pmatch` macro. I changed
+the variable names in the macro and removed those special cases, slightly
+reducing the complexity of the `eval_apply` function and bringing the attention
+to the lack of hygiene in our macro expander, which, by the way, is also
+intertwined in the `eval_apply` function.
+
+Well, and I wrote the benchmark tool, that happens to be very useful as a test
+for all these small changes, too.
+
+All the changes you can find in the `faster` branch in the following
+repo[^regret]:
+
+<http://github.com/ekaitz-zarraga/mes>
+
+[^regret]: What you cannot see is all the changes I tried to do and I regretted
+    later.
+
+### Next steps
+
+None of the work done really tackles the root issue but hacking around is
+important to familiarize yourself with the code, and I'm the kind of person
+that requires quite a bit of context before getting into action.
+
+I think now I'm ready for the real deal: **the evaluator swallows most of the
+time, and we need to change it**. We just need a plan for it.
+
+First I have to prepare for the big change. Removing the macro expansion from
+the famous `eval_apply` function and reducing it to a plain Scheme evaluator
+(renamed to `eval_core`) that can only work with the basic forms and the
+builtins would help me in two fronts: working on the macro expansion alone,
+opening the door for writing a fully hygienic macro expander from scratch; and
+reducing the `eval_core` function even more so it can be replaced later.
+
+Currently, as I write this lines, I just finished the removal. I can `git show`
+and see all the macro expansion related functionality in red.
+
+Now I can get to the work. I have to write a macro expander. I'm yet to decide
+if I should write it first in Scheme (without macros) and translate it later or
+if I should write it directly in C. In any case, the macro expansion step would
+be separated from the evaluation.
+
+The separation is not absolute, because the macro expander would make use of
+the brand new `eval_core` function to run the macros, but at least we would
+have the two concerns clearly split in the code. Consider this the 0 step.
+
+With the split, it would be time to start thinking about the compilation and
+evaluation. In the first thought I didn't know very well what to do, because if
+the evaluator works on bytecode, how would I expand the macros then?
+
+I gave this too much thought, but it wasn't that difficult, really. If the
+macro expander is not only a macro expander but also a compiler, the body of
+the macros can be compiled as they come. The macro expansion would still work
+with the `eval_core` function, the same as it did, but now it wouldn't eval
+Scheme code, but bytecode.
+
+Even more, we could compile all the Scheme code as we read it for macro
+expansion, so we could use helper functions from the current file in the macro
+expansion step, and we would traverse the tree only once.
+
+Time for writing a macro expander, then.
+
+Another thing I never did before. I'm sure it will be a lot of fun.
author	Ekaitz Zarraga <ekaitz@elenq.tech>	2025-06-03 00:48:17 +0200
committer	Ekaitz Zarraga <ekaitz@elenq.tech>	2025-06-04 23:31:02 +0200
commit	3a3322ca63c300abc56c4a5452150fdfdd73cd44 (patch)
tree	bf2dd0485935e2d7ca00a30a17e9b58595ee1b69 /content/fasterMes/01_performance_report.md
parent	39af79fd69c15fa001129567ec6508a5c0a95177 (diff)