summaryrefslogtreecommitdiff
path: root/content/fasterMes/01_performance_report.md
diff options
context:
space:
mode:
authorEkaitz Zarraga <ekaitz@elenq.tech>2025-06-03 00:48:17 +0200
committerEkaitz Zarraga <ekaitz@elenq.tech>2025-06-04 23:31:02 +0200
commit3a3322ca63c300abc56c4a5452150fdfdd73cd44 (patch)
treebf2dd0485935e2d7ca00a30a17e9b58595ee1b69 /content/fasterMes/01_performance_report.md
parent39af79fd69c15fa001129567ec6508a5c0a95177 (diff)
fasterMes: 01_performance_report.md: Add it.
Diffstat (limited to 'content/fasterMes/01_performance_report.md')
-rw-r--r--content/fasterMes/01_performance_report.md547
1 files changed, 547 insertions, 0 deletions
diff --git a/content/fasterMes/01_performance_report.md b/content/fasterMes/01_performance_report.md
new file mode 100644
index 0000000..fb219b8
--- /dev/null
+++ b/content/fasterMes/01_performance_report.md
@@ -0,0 +1,547 @@
+Title: Understanding GNU Mes' performance
+Date: 2025-06-02
+Category:
+Tags: GNU Mes interpreter speedup
+Slug: fasterMes1
+Lang: en
+Summary: How fast is GNU Mes' scheme interpreter really?
+Status: draft
+
+Before[^heh] starting our work in [the project][project] we should, first, make
+sure that we know that our hypothesis is true and GNU Mes could be faster and,
+second, try to understand where we could improve the execution speed.
+
+[^heh]: Hypothetically speaking, because the work has already started. But
+ ssssh...
+
+[project]: /tag/gnu-mes-interpreter-speedup.html
+
+### The benchmarking tool
+
+In order to check how fast GNU Mes is, I created a very simple benchmarking
+tool. The tool just runs GNU Mes under `time` (the command, not the built-in),
+and creates a report in a bunch of text files. Of course, `time` is not a great
+way to benchmark, but my hypothesis was that `mes` (remember, the scheme
+interpreter in GNU Mes) could be way faster than it is, so, if that's true,
+`time` should be more than enough for the comparison.
+
+This is done in the `simple.make` makefile, that has been revived to serve this
+development period. The `simple.make` makefile is important because normally
+GNU Mes is built from scratch, including the `meslibc` (remember, the C
+standard library in GNU Mes), which takes a long time. The `simple.make` file
+builds just `mes`, the interpreter, from GCC or M2-Planet, so the iteration
+time is acceptable during the development, meaning it's almost instantaneous.
+
+So, in summary this:
+
+``` bash
+make -f simple.make benchmark
+```
+
+Would run the benchmark for the default `mes`, the one built by GCC (which is
+of course built if needed, too). But we also have `benchmark-m2`, for the `mes`
+built by M2-Planet.
+
+The benchmark at this very moment runs three cases:
+
+- `fib`: A simple scheme program that finds the 35th element of the Fibonacci
+ sequence.
+- `mescc-hello`: A `mescc` call that compiles a simple hello-world program
+ written in C.
+- `mescc-mes`: A `mescc` call that compiles `mes`'s source code.
+
+These three cases cover very basic operations but let us know if our changes are
+relevant in `mes`'s main job: compiling some C code. But also give us some
+interesting feedback about the evaluation speed in the `fib` check, specially
+in a very recursive program. The `mescc-hello` is a minimal test that shows us
+the startup speed of the `mescc` compiler.
+
+However, that alone doesn't give us a lot of information, because we don't
+have anything to compare with. The scheme interpreter in GNU Mes is compatible
+with Guile meaning that it could be replaced with Guile in any moment (maybe
+not the other way around). There it is our great candidate for comparison.
+
+#### The data
+
+The following table shows the execution of all the benchmarks' results for
+wall-time written in `Average ±StdDev` form. Each benchmark is run 10 times in
+my laptop (*i7-10510U*).
+
+| | mes-gcc | mes-m2 | guile | guile<sup>†</sup> | guile<sup>†‡</sup> |
+|:--------------|-------------:|-------------:|------------:|------------------:|----------------------:|
+| Fib | 21.39 ±0.66 | 49.13 ±0.11 | 0.27 ±0.00 | 3.57 ±0.27 | 8.67 ±0.38 |
+| Mescc-hello | 1.38 ±0.04 | 2.69 ±0.01 | 0.41 ±0.01 | 0.45 ±0.03 | 1.03 ±0.04 |
+| Mescc-mes | 103.39 ±3.03 | 214.88 ±0.34 | 17.03 ±0.30 | 18.70 ±0.95 | 27.34 ±1.09 |
+
+<dl>
+<dt>†</dt> <dd>With <code>GUILE_AUTO_COMPILE=0</code></dd>
+<dt>‡</dt> <dd>With <code>GUILE_JIT_THRESHOLD=-1</code></dd>
+</dl>
+
+From left to right: _mes-gcc_ means `mes` built with default GCC (including
+optimizations), _mes-m2_ means `mes` built with M2-Planet and _guile_ means
+running `guile` (`v3.0.9`) instead of `mes`.
+
+The first conclusion is pretty obvious:
+
+> Guile is way faster than Mes when compiling C code, even with all the
+> optimizations disabled.
+
+##### Unfair!
+
+But that conclusion may look unfair to most of you, specially because Guile is
+a pretty large and powerful scheme implementation, while Mes is a simple scheme
+interpreter designed to be as minimal as possible.
+
+Chibi-Scheme is a very interesting small scheme interpreter that happens to be
+somewhat slow comparing with other major scheme implementations. In terms of
+lines of code, `chibi-scheme` is around the twice the size of `mes`, which is
+still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is,
+so we can only run the `fib` benchmark with it. The following table shows a
+comparison between the best Mes scenario vs Chibi-Scheme.
+
+| | Mes | Chibi-Scheme |
+|:---------------------------------|------------:|-------------:|
+| Lines of Code<sup>†</sup> | 5838 | 13135 |
+| Fib<sup>‡</sup> (s) | 21.39 ±0.66 | 1.77 ±0.03 |
+
+<dl>
+<dt>†</dt> <dd>Only counts the lines in <code>.c</code> files</dd>
+<dt>‡</dt> <dd>Both built with GCC</dd>
+</dl>
+
+That's quite an improvement for the size!
+
+#### Conclusion from the benchmark
+
+Before taking any conclusion, we should make sure our assumption is true and
+the differences are high enough to be captured by the `time` command. Looking
+to the orders of magnitude we are dealing with here, I would say **the data we
+obtained is a valid reference point**, as the differences are large enough to
+be properly captured by `time` and our goal is to have a noticeable performance
+improvement.
+
+From there, we understand the obvious: Guile is way faster than Mes. Even with
+compilation disabled, but that being a huge difference. Chibi-Scheme is also
+way faster, but not as fast as Guile with all optimizations enabled.
+
+When it comes to the compiler used to build Mes, when compiled with M2 `mes` is
+twice as slow as with GCC. This is a very relevant detail to keep in mind
+during development. GCC's optimizations account for a huge performance
+improvement in `mes`, and we cannot rely on them during the bootstrapping
+process.
+
+
+### Profiling
+
+Still, we are far from justifying a full rewrite of the evaluator, and even
+further from rewriting `mes` as a bytecode interpreter, as proposed in the
+project presentation. That's why I decided to run a profiler, again, the
+simplest possible, so I could get some insights from the execution of the
+program.
+
+I run `gprof` in `mes` for both `fib` and `mescc-mes` cases[^benchmark-useful]
+and I summarize the output of the flat-profile below, right after a little
+reminder of what do all the columns in the table mean.
+
+Just to clarify, `gprof` requires the program to be built with GCC with `-pg`
+flag, we don't have profilers for the program built M2-Planet, but I think the
+insights obtained from this could be extrapolated to `mes-m2` in most cases.
+You'll see why in a minute.
+
+[^benchmark-useful]: See how the benchmark is useful?
+
+###### Legend
+
+<dl>
+<dt>% time</dt>
+<dd>the percentage of the total running time of the program
+used by this function.</dd>
+
+<dt>cumulative seconds</dt>
+<dd>a running sum of the number of seconds accounted for by this
+function and those listed above it.</dd>
+
+<dt>self seconds</dt>
+<dd>the number of seconds accounted for by this function alone. This is
+the major sort for this listing.</dd>
+
+<dt>calls</dt>
+<dd>the number of times this function was invoked, if this function is
+profiled, else blank.</dd>
+
+<dt>self ms/call</dt>
+<dd>the average number of milliseconds spent in this function per call,
+if this function is profiled, else blank.</dd>
+
+<dt>total ms/call</dt>
+<dd>the average number of milliseconds spent in this function and its
+descendents per call, if this function is profiled, else blank.</dd>
+
+<dt>name</dt>
+<dd>the name of the function. This is the minor sort for this listing.
+The index shows the location of the function in the gprof listing. If
+the index is in parenthesis it shows where it would appear in the gprof
+listing if it were to be printed.</dd>
+</dl>
+
+###### Gprof flat-profile for `fib`
+
+| %time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name |
+|------:|-------------------:|--------------:|----------:|-------------:|--------------:|:---------------------|
+| 20.06 | 4.01 | 4.01 | 37 | 108.38 | 416.96 | `eval_apply` |
+| 11.26 | 6.26 | 2.25 | 600202232 | 0.00 | 0.00 | `gc_push_frame` |
+| 7.10 | 7.68 | 1.42 | 600202232 | 0.00 | 0.00 | `gc_peek_frame` |
+| 6.83 | 9.05 | 1.37 | 481504722 | 0.00 | 0.00 | `make_cell` |
+| 6.03 | 10.25 | 1.21 | 600200793 | 0.00 | 0.00 | `push_cc` |
+| 5.48 | 11.35 | 1.10 | 300124024 | 0.00 | 0.00 | `struct_ref_` |
+| 4.70 | 12.29 | 0.94 | | | | `_init` |
+| 3.70 | 13.03 | 0.74 | 269868352 | 0.00 | 0.00 | `make_value_cell` |
+| 2.35 | 13.50 | 0.47 | 600202232 | 0.00 | 0.00 | `gc_pop_frame` |
+| 2.15 | 13.93 | 0.43 | 406099172 | 0.00 | 0.00 | `cons` |
+| 2.15 | 14.36 | 0.43 | 74983231 | 0.00 | 0.00 | `apply_builtin` |
+| 2.00 | 14.76 | 0.40 | 300279513 | 0.00 | 0.00 | `cell_ref` |
+| 1.95 | 15.15 | 0.39 | 300124024 | 0.00 | 0.00 | `assert_struct` |
+| 1.85 | 15.52 | 0.37 | 135158343 | 0.00 | 0.00 | `length__` |
+| 1.60 | 15.84 | 0.32 | 29862855 | 0.00 | 0.00 | `minus` |
+| 1.38 | 16.11 | 0.28 | 300124024 | 0.00 | 0.00 | `assert_range` |
+| 1.38 | 16.39 | 0.28 | 75315259 | 0.00 | 0.00 | `assq` |
+| 1.18 | 16.62 | 0.24 | 75280752 | 0.00 | 0.00 | `lookup_binding` |
+| 1.08 | 16.84 | 0.22 | 75280689 | 0.00 | 0.00 | `make_binding_` |
+| 1.05 | 17.05 | 0.21 | 269866060 | 0.00 | 0.00 | `make_number` |
+| 1.05 | 17.26 | 0.21 | 173 | 1.21 | 1.21 | `macro_set_x` |
+| 1.00 | 17.46 | 0.20 | 135178741 | 0.00 | 0.00 | `gc_check` |
+| 0.90 | 17.64 | 0.18 | 30093052 | 0.00 | 0.00 | `pairlis` |
+| 0.85 | 17.81 | 0.17 | 105069896 | 0.00 | 0.00 | `check_formals` |
+| 0.83 | 17.97 | 0.17 | 29861090 | 0.00 | 0.00 | `less_p` |
+| ... | ... | ... | ... | ... | ... | ... |
+| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `open_boot` |
+| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 3.09 | `read_boot` |
+| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `reader_read_octal` |
+| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `try_open_boot` |
+
+###### Gprof flat-profile for `mescc-mes`
+
+| %time | cumulative seconds | self seconds | calls | self s/call | total s/call | name |
+|------:|-------------------:|--------------:|-----------:|------------:|-------------:|:---------------------|
+| 31.64 | 58.01 | 58.01 | 81 | 0.72 | 1.91 | `eval_apply` |
+| 7.53 | 71.82 | 13.81 | 428598051 | 0.00 | 0.00 | `assq` |
+| 6.11 | 83.03 | 11.21 | 2789709236 | 0.00 | 0.00 | `make_cell` |
+| 5.96 | 93.95 | 10.92 | 2109168010 | 0.00 | 0.00 | `gc_push_frame` |
+| 4.88 | 102.89 | 8.94 | 607787117 | 0.00 | 0.00 | `length__` |
+| 4.11 | 110.43 | 7.53 | 2109168010 | 0.00 | 0.00 | `gc_peek_frame` |
+| 3.61 | 117.04 | 6.61 | 891765473 | 0.00 | 0.00 | `struct_ref_` |
+| 3.39 | 123.25 | 6.22 | 2109164356 | 0.00 | 0.00 | `push_cc` |
+| 3.39 | 129.46 | 6.21 | | | | `_init` |
+| 2.27 | 133.63 | 4.17 | 1742421 | 0.00 | 0.00 | `assoc_string` |
+| 1.96 | 137.22 | 3.59 | 2235192336 | 0.00 | 0.00 | `cons` |
+| 1.79 | 140.50 | 3.28 | 215232553 | 0.00 | 0.00 | `apply_builtin` |
+| 1.54 | 143.33 | 2.83 | 679409558 | 0.00 | 0.00 | `make_value_cell` |
+| 1.49 | 146.06 | 2.73 | 200300883 | 0.00 | 0.00 | `pairlis` |
+| 1.18 | 148.22 | 2.16 | 2109168010 | 0.00 | 0.00 | `gc_pop_frame` |
+| 1.13 | 150.28 | 2.07 | 925236148 | 0.00 | 0.00 | `cell_ref` |
+| 1.10 | 152.30 | 2.02 | 196474232 | 0.00 | 0.00 | `gc_copy` |
+| 1.07 | 154.26 | 1.96 | 419658520 | 0.00 | 0.00 | `lookup_value` |
+| 1.04 | 156.17 | 1.91 | 411189650 | 0.00 | 0.00 | `check_formals` |
+| 1.02 | 158.03 | 1.87 | 423486766 | 0.00 | 0.00 | `lookup_binding` |
+| 0.98 | 159.82 | 1.79 | 183 | 0.01 | 0.01 | `gc_cellcpy` |
+| 0.91 | 161.50 | 1.68 | 891765473 | 0.00 | 0.00 | `assert_struct` |
+| 0.86 | 163.08 | 1.58 | 423486608 | 0.00 | 0.00 | `make_binding_` |
+| 0.78 | 164.51 | 1.43 | 183 | 0.01 | 0.03 | `gc_loop` |
+| 0.78 | 165.94 | 1.43 | 635403488 | 0.00 | 0.00 | `gc_check` |
+| ... | ... | ... | ... | ... | ... | ... |
+| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `open_boot` |
+| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `open_output_file` |
+| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `read_boot` |
+| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `reader_read_octal` |
+
+#### The profiling data in context
+
+The functions in the top are mostly the same in both cases: `eval_apply`,
+`gc_push_frame`, `push_cc`, `gc_peek_frame` and `make_cell`. They represent
+around the 50% of the execution time (51.21% in `fib` and 51.28% in
+`mescc-mes`) and they are closely related.
+
+Before trying to explain why, we could spend a minute checking how they are
+eating that amount of time. `eval_apply` is executed a few times but it takes a
+long time to run, while the rest of them are very fast functions (0.00 ms!) but
+they are executed millions of times per second.
+
+The reason for that is `eval_apply` is a long running loop that is calling
+them. It's calling them **all the time**.
+
+##### Deep in `eval_apply`
+
+As I already explained in earlier episodes, `mes` is a tree-walk interpreter,
+but that doesn't say much about how things work.
+
+The current design of `mes` is very interesting. I didn't ask the author about
+it but it feels like it was written in Scheme first and translated to C later.
+I say this because the `eval_apply` function, the core of the program, that
+expands the macros, traverses the tree, and does all the `eval`ing and the
+`apply`ing, is written as it would be written in Scheme: abusing the tail-call
+optimization[^tco] with mutually recursive functions.
+
+[^tco]: Way before all the Scheme, compilers, interpreters and Guix, [I
+ wrote a post about it]({filename}/call-me-maybe.md).
+
+That's pretty understandable, if you are writing a Scheme interpreter you'd
+like to write Scheme, so you Schemeify your C as much as you can. Also, it
+happens that Scheme can be very elegantly described in Scheme, as the [GNU Mes
+manual mentions][mes-manual], and most of the literature about Scheme
+interpreters uses Scheme. This tradition goes back to the SICP, and even
+further to the invention (discovery?) of LISP itself.
+
+[mes-manual]: https://www.gnu.org/software/mes/manual/mes.html#LISP-as-Maxwell_0027s-Equations-of-Software
+
+But C is not very compatible with that programming style. The C stack would
+overflow due to the lack of tail-call optimization.
+
+GNU Mes overcomes that limitation manually manipulating a stack, that is
+allocated by its memory manager, and emulating function calls by doing a
+`goto`. The argument passing to these "functions" is done with some global
+variables that are understood as "registers" in this "virtual machine".
+
+Consider this piece of code:
+
+``` clike
+evlis:
+ if (R1 == cell_nil)
+ goto vm_return;
+ if (R1->type != TPAIR)
+ goto eval;
+ push_cc (R1->car, R1, R0, cell_vm_evlis2);
+ goto eval;
+evlis2:
+ push_cc (R2->cdr, R1, R0, cell_vm_evlis3);
+ goto evlis;
+evlis3:
+ R1 = cons (R2, R1);
+ goto vm_return;
+```
+
+The "registers" are those `R1`, `R0` and so on, and we don't really need to see
+what they mean to understand how this whole thing works.
+
+Imagine we enter `evlis` with an `R1` that is a list: it's not a `cell_nil` and
+it's type is `TPAIR`. So we skip both `if` statements and we reach the
+`push_cc`. This call will push to the stack our current position in the code,
+`cell_vm_evlis2`, and then jump to the `eval` "function". `eval` eventually
+ends, doing a `goto vm_return` which pops the current frame and jumps to the
+top of the `eval_apply` function.
+
+``` clike
+vm_return:
+ x = R1; // This saves the return value of the "call"
+ gc_pop_frame ();
+ R1 = x; // and puts it back, after it has been overwritten by the pop
+ goto eval_apply;
+```
+
+And, what do we find there?
+
+A huge `if` cascade that brings us to the label where we left:
+
+``` clike
+eval_apply:
+ if (R3 == cell_vm_evlis2)
+ goto evlis2; // Bingo!
+ else if (R3 == cell_vm_evlis3)
+ goto evlis3;
+ else if (R3 == cell_vm_eval_check_func)
+ goto eval_check_func;
+ else if (R3 == cell_vm_eval2)
+ goto eval2;
+ // There are quite a few more below.
+```
+
+Didn't we just emulate a "function call"?
+
+I'm sure you can continue with the execution from `evlis2` by yourselves, just
+with the code I pasted here, and understand how a list is evaluated in `mes`.
+
+It might feel crazy, but that's because during the explanation the goal of this
+approach was left aside for a minute. I didn't mention that the `eval` label we
+jumped to might be jumping to many other places, including `apply`, which would
+jump to `eval` again, which, as we mentioned before, would blow up our stack if
+we were doing this in C functions. Doing it this (weird) way lets you control
+the stack and manually apply the tail-call optimization, i.e. just use `goto`
+without the `push_cc`.
+
+This whole thing is looping all the time, doing `goto eval_apply` over and over
+again, and that's why the actual `eval_apply` function takes that long to run.
+It's the main loop of the interpreter.
+
+Now we are in a great position to understand why those other functions were
+called that many times. `push_cc` is the easiest one, as it's what it prepares
+a call to one of those internal "functions" that require pushing to the stack.
+`gc_push_frame` is called by `push_cc` and is the function that actually does
+the pushing. `gc_peek_frame` is also called many times, but we don't see that
+in the listings I shared. What happens here is `gc_peek_frame` is called by
+`gc_pop_frame`, exactly what the `vm_return` block does.
+
+The `make_cell` case is slightly different but also related. It's called mostly
+from `cons`, and it's sibling `make_value_cell` is also called many times.
+These are the functions that allocate Scheme values. Of course, we cannot
+simply eliminate the calls to these functions, because a Scheme interpreter
+would always need to create values, but my bet is that these functions,
+specially `cons` are called more often that strictly needed because of the
+evaluation strategy I just described.
+
+### Conclusions
+
+There are many conclusions to obtain from all this data and the design of the
+evaluator.
+
+The first and more obvious one is that it looks like the project proposal makes
+sense and **we could make `mes` way faster**, as Guile is, **even keeping the
+interpreter small**, as shown in Chibi-Scheme. Of course, I don't expect to
+obtain a 10x performance improvement in the first iteration, but I believe we
+can do a lot *"starting with the simplest thing that could possibly
+work*[^janneke]" if we keep a non-pessimization philosophy.
+
+[^janneke]: I'm quoting Janneke, the author of GNU Mes, who has a very
+ pragmatical approach to programming.
+
+The question now is, what is *that* thing?
+
+I proposed to turn `mes` into a **bytecode interpreter**. This alone might
+sound as an overkill but from all the other options I considered it sounded
+like the simplest, and there is evidence that it could possibly work: Guile and
+Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks
+we saw they can be very fast comparing to GNU Mes' Scheme interpreter.
+
+Other possibilities like tree traversal memoization, like the ones applied in
+older versions of Guile didn't feel like they would have such a profound impact
+in the evaluation process, and I felt like they were going to be harder for me
+to implement.
+
+But still, there's a lot more we can obtain from the data.
+
+When comparing `mes-gcc` and `mes-m2`, we could see that small tricks would not
+work if we want to improve the speed of the bootstrapping process. It's not
+that hard to write some code in a different way so GCC does its magic and
+optimizes for a 10% improvement, but when it comes to M2-Planet we are alone,
+there's no compiler optimization there to help us, and **we must be smart and
+use optimizations that go deeper, those that happen in the CPU**. Bytecode
+could give us some of that *data locality* we need.
+
+We must remember the profiler was run on `mes-gcc`, but the information we
+obtained from that can be extrapolated to the `mes-m2` case: **it is the
+current implementation of `mes` that is inefficient by design as it was
+designed to be simple, not fast**. All the `push`ing and the `pop`ing of scheme
+values to that internal managed stack happens too often.
+
+An evaluator that is able to run an already processed program is, potentially,
+way faster than one that has to traverse the tree over and over again, while it
+does all this `goto` soup, that is also pretty hard to read. More specifically,
+a bytecode interpreter would not only reduce the execution time of those
+functions on the top in the profiler's list, but many others that are related
+to them in the very long tail of function calls that I didn't bother to paste
+in this post. **This cascade effect could add up to a huge performance
+improvement** further than that 50% of execution time we could find in those 5
+functions on the top.
+
+Everything that can be processed ahead of time should be done (macro expansion,
+lambda hoisting, argument checking, compilation...) first and only once if
+possible, so the evaluator can focus on its main goal: running the code, as
+fast as possible.
+
+What I didn't mention, though, are all the functions that don't look relevant
+in the profiling: they are just good. The garbage collector and allocator in
+`mes` simply works and goes unnoticed in the profiling with a 1% of execution
+time in the `gc_copy` function and a very fast allocation that only becomes a
+noticeable when a huge number (hundreds of millions) of values are allocated.
+
+When it comes to readability and auditability, one of the goals of the project,
+personally I consider the `eval_apply` function is very hard to follow. There
+are still many tricks that I don't fully understand, and I already spent
+several weeks reading and hacking on the interpreter, which isn't that large
+after all. Rewriting it is a chance for making it easier to read.
+
+
+### Work done
+
+Understanding the interpreter internals isn't an immediate task, so in order to
+get used to it I decided to go for some low-hanging fruit first.
+
+First, I moved some operations to builtins and even created some. The
+performance gain I saw was simply **zero**. Probably because the benchmark is
+not designed to detect small improvements and because the main performance
+bottleneck is in the `eval_apply` function.
+
+Same thing happened when updated the M2-Planet and I added `switch` statements
+that replace chained `if`s that were there because of M2-Planet didn't support
+`switch` statements in previous versions. The change affected `mes-gcc`'s
+performance slightly, but not enough to consider it a big win. `mes-m2`
+remained mostly the same as it doesn't apply any kind of optimization in
+`switch` statements. I still consider this change important for readability, so
+much so it helped me detect a bug in the string reader (the newline escaping
+was broken). That I also fixed.
+
+The `eval_apply` function has been a puzzle to me for a long time to the point
+that I started to look to any possibility to make it shorter, simpler or
+anything that could help me understand it. In that process I found some very
+specific cases in that `goto` soup that mentioned `pmatch`. These were the
+result of a hack that avoided a hygiene issue in the `pmatch` macro. I changed
+the variable names in the macro and removed those special cases, slightly
+reducing the complexity of the `eval_apply` function and bringing the attention
+to the lack of hygiene in our macro expander, which, by the way, is also
+intertwined in the `eval_apply` function.
+
+Well, and I wrote the benchmark tool, that happens to be very useful as a test
+for all these small changes, too.
+
+All the changes you can find in the `faster` branch in the following
+repo[^regret]:
+
+<http://github.com/ekaitz-zarraga/mes>
+
+[^regret]: What you cannot see is all the changes I tried to do and I regretted
+ later.
+
+### Next steps
+
+None of the work done really tackles the root issue but hacking around is
+important to familiarize yourself with the code, and I'm the kind of person
+that requires quite a bit of context before getting into action.
+
+I think now I'm ready for the real deal: **the evaluator swallows most of the
+time, and we need to change it**. We just need a plan for it.
+
+First I have to prepare for the big change. Removing the macro expansion from
+the famous `eval_apply` function and reducing it to a plain Scheme evaluator
+(renamed to `eval_core`) that can only work with the basic forms and the
+builtins would help me in two fronts: working on the macro expansion alone,
+opening the door for writing a fully hygienic macro expander from scratch; and
+reducing the `eval_core` function even more so it can be replaced later.
+
+Currently, as I write this lines, I just finished the removal. I can `git show`
+and see all the macro expansion related functionality in red.
+
+Now I can get to the work. I have to write a macro expander. I'm yet to decide
+if I should write it first in Scheme (without macros) and translate it later or
+if I should write it directly in C. In any case, the macro expansion step would
+be separated from the evaluation.
+
+The separation is not absolute, because the macro expander would make use of
+the brand new `eval_core` function to run the macros, but at least we would
+have the two concerns clearly split in the code. Consider this the 0 step.
+
+With the split, it would be time to start thinking about the compilation and
+evaluation. In the first thought I didn't know very well what to do, because if
+the evaluator works on bytecode, how would I expand the macros then?
+
+I gave this too much thought, but it wasn't that difficult, really. If the
+macro expander is not only a macro expander but also a compiler, the body of
+the macros can be compiled as they come. The macro expansion would still work
+with the `eval_core` function, the same as it did, but now it wouldn't eval
+Scheme code, but bytecode.
+
+Even more, we could compile all the Scheme code as we read it for macro
+expansion, so we could use helper functions from the current file in the macro
+expansion step, and we would traverse the tree only once.
+
+Time for writing a macro expander, then.
+
+Another thing I never did before. I'm sure it will be a lot of fun.