Title: Understanding GNU Mes' performance
Date: 2025-06-02
Category:
Tags: GNU Mes interpreter speedup
Slug: fasterMes1
Lang: en
Summary: How fast is GNU Mes' scheme interpreter really?

Before[^heh] starting our work in [the project][project] we should, first, make
sure that we know that our hypothesis is true and GNU Mes could be faster and,
second, try to understand where we could improve the execution speed.

[^heh]: Hypothetically speaking, because the work has already started. But
    ssssh...

[project]: /tag/gnu-mes-interpreter-speedup.html

### The benchmarking tool

In order to check how fast GNU Mes is, I created a very simple benchmarking
tool. The tool just runs GNU Mes under `time` (the command, not the built-in),
and creates a report in a bunch of text files. Of course, `time` is not a great
way to benchmark, but my hypothesis was that `mes` (remember, the scheme
interpreter in GNU Mes) could be way faster than it is, so, if that's true,
`time` should be more than enough for the comparison.

This is done in the `simple.make` makefile, that has been revived to serve this
development period. The `simple.make` makefile is important because normally
GNU Mes is built from scratch, including the `meslibc` (remember, the C
standard library in GNU Mes), which takes a long time. The `simple.make` file
builds just `mes`, the interpreter, from GCC or M2-Planet, so the iteration
time is acceptable during the development, meaning it's almost instantaneous.

So, in summary this:

``` bash
make -f simple.make benchmark
```

Would run the benchmark for the default `mes`, the one built by GCC (which is
of course built if needed, too). But we also have `benchmark-m2`, for the `mes`
built by M2-Planet.

The benchmark at this very moment runs three cases:

- `fib`: A simple scheme program that finds the 35th element of the Fibonacci
  sequence.
- `mescc-hello`: A `mescc` call that compiles a simple hello-world program
  written in C.
- `mescc-mes`: A `mescc` call that compiles `mes`'s source code.

These three cases cover very basic operations but let us know if our changes are
relevant in `mes`'s main job: compiling some C code. But also give us some
interesting feedback about the evaluation speed in the `fib` check, specially
in a very recursive program. The `mescc-hello` is a minimal test that shows us
the startup speed of the `mescc` compiler.

However, that alone doesn't give us a lot of information, because we don't
have anything to compare with. The scheme interpreter in GNU Mes is compatible
with Guile meaning that it could be replaced with Guile in any moment (maybe
not the other way around). There it is our great candidate for comparison.

#### The data

The following table shows the execution of all the benchmarks' results for
wall-time written in `Average ±StdDev` form. Each benchmark is run 10 times in
my laptop (*i7-10510U*).

|               | mes-gcc      | mes-m2       | guile       | guile<sup>†</sup> | guile<sup>†‡</sup>    |
|:--------------|-------------:|-------------:|------------:|------------------:|----------------------:|
| Fib           |  21.39 ±0.66 |  49.13 ±0.11 |  0.27 ±0.00 |        3.57 ±0.27 |            8.67 ±0.38 |
| Mescc-hello   |   1.38 ±0.04 |   2.69 ±0.01 |  0.41 ±0.01 |        0.45 ±0.03 |            1.03 ±0.04 |
| Mescc-mes     | 103.39 ±3.03 | 214.88 ±0.34 | 17.03 ±0.30 |       18.70 ±0.95 |           27.34 ±1.09 |

<dl>
<dt>†</dt> <dd>With <code>GUILE_AUTO_COMPILE=0</code></dd>
<dt>‡</dt> <dd>With <code>GUILE_JIT_THRESHOLD=-1</code></dd>
</dl>

From left to right: _mes-gcc_ means `mes` built with default GCC (including
optimizations), _mes-m2_ means `mes` built with M2-Planet and _guile_ means
running `guile` (`v3.0.9`) instead of `mes`.

The first conclusion is pretty obvious:

>  Guile is way faster than Mes when compiling C code, even with all the
>  optimizations disabled.

##### Unfair!

But that conclusion may look unfair to most of you, specially because Guile is
a pretty large and powerful scheme implementation, while Mes is a simple scheme
interpreter designed to be as minimal as possible.

Chibi-Scheme is a very interesting small scheme interpreter that happens to be
somewhat slow comparing with other major scheme implementations. In terms of
lines of code, `chibi-scheme` is around the twice the size of `mes`, which is
still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is,
so we can only run the `fib` benchmark with it. The following table shows a
comparison between the best Mes scenario vs Chibi-Scheme.

|                                  | Mes         | Chibi-Scheme |
|:---------------------------------|------------:|-------------:|
| Lines of Code<sup>†</sup>        |        5838 |        13135 |
| Fib<sup>‡</sup> (s)              | 21.39 ±0.66 |   1.77 ±0.03 |

<dl>
<dt>†</dt> <dd>Only counts the lines in <code>.c</code> files</dd>
<dt>‡</dt> <dd>Both built with GCC</dd>
</dl>

That's quite an improvement for the size!

#### Conclusion from the benchmark

Before taking any conclusion, we should make sure our assumption is true and
the differences are high enough to be captured by the `time` command. Looking
to the orders of magnitude we are dealing with here, I would say **the data we
obtained is a valid reference point**, as the differences are large enough to
be properly captured by `time` and our goal is to have a noticeable performance
improvement.

From there, we understand the obvious: Guile is way faster than Mes. Even with
compilation disabled, but that being a huge difference. Chibi-Scheme is also
way faster, but not as fast as Guile with all optimizations enabled.

When it comes to the compiler used to build Mes, when compiled with M2 `mes` is
twice as slow as with GCC. This is a very relevant detail to keep in mind
during development. GCC's optimizations account for a huge performance
improvement in `mes`, and we cannot rely on them during the bootstrapping
process.


### Profiling

Still, we are far from justifying a full rewrite of the evaluator, and even
further from rewriting `mes` as a bytecode interpreter, as proposed in the
project presentation. That's why I decided to run a profiler, again, the
simplest possible, so I could get some insights from the execution of the
program.

I run `gprof` in `mes` for both `fib` and `mescc-mes` cases[^benchmark-useful]
and I summarize the output of the flat-profile below, right after a little
reminder of what do all the columns in the table mean.

Just to clarify, `gprof` requires the program to be built with GCC with `-pg`
flag, we don't have profilers for the program built M2-Planet, but I think the
insights obtained from this could be extrapolated to `mes-m2` in most cases.
You'll see why in a minute.

[^benchmark-useful]: See how the benchmark is useful?

###### Legend

<dl>
<dt>% time</dt>
<dd>the percentage of the total running time of the program
used by this function.</dd>

<dt>cumulative seconds</dt>
<dd>a running sum of the number of seconds accounted for by this
function and those listed above it.</dd>

<dt>self seconds</dt>
<dd>the number of seconds accounted for by this function alone. This is
the major sort for this listing.</dd>

<dt>calls</dt>
<dd>the number of times this function was invoked, if this function is
profiled, else blank.</dd>

<dt>self ms/call</dt>
<dd>the average number of milliseconds spent in this function per call,
if this function is profiled, else blank.</dd>

<dt>total ms/call</dt>
<dd>the average number of milliseconds spent in this function and its
descendents per call, if this function is profiled, else blank.</dd>

<dt>name</dt>
<dd>the name of the function.  This is the minor sort for this listing.
The index shows the location of the function in the gprof listing. If
the index is in parenthesis it shows where it would appear in the gprof
listing if it were to be printed.</dd>
</dl>

###### Gprof flat-profile for `fib`

| %time | cumulative seconds | self seconds  |     calls | self ms/call | total ms/call | name                 |
|------:|-------------------:|--------------:|----------:|-------------:|--------------:|:---------------------|
| 20.06 |               4.01 |         4.01  |        37 |       108.38 |        416.96 | `eval_apply`         |
| 11.26 |               6.26 |         2.25  | 600202232 |         0.00 |          0.00 | `gc_push_frame`      |
|  7.10 |               7.68 |         1.42  | 600202232 |         0.00 |          0.00 | `gc_peek_frame`      |
|  6.83 |               9.05 |         1.37  | 481504722 |         0.00 |          0.00 | `make_cell`          |
|  6.03 |              10.25 |         1.21  | 600200793 |         0.00 |          0.00 | `push_cc`            |
|  5.48 |              11.35 |         1.10  | 300124024 |         0.00 |          0.00 | `struct_ref_`        |
|  4.70 |              12.29 |         0.94  |           |              |               | `_init`              |
|  3.70 |              13.03 |         0.74  | 269868352 |         0.00 |          0.00 | `make_value_cell`    |
|  2.35 |              13.50 |         0.47  | 600202232 |         0.00 |          0.00 | `gc_pop_frame`       |
|  2.15 |              13.93 |         0.43  | 406099172 |         0.00 |          0.00 | `cons`               |
|  2.15 |              14.36 |         0.43  |  74983231 |         0.00 |          0.00 | `apply_builtin`      |
|  2.00 |              14.76 |         0.40  | 300279513 |         0.00 |          0.00 | `cell_ref`           |
|  1.95 |              15.15 |         0.39  | 300124024 |         0.00 |          0.00 | `assert_struct`      |
|  1.85 |              15.52 |         0.37  | 135158343 |         0.00 |          0.00 | `length__`           |
|  1.60 |              15.84 |         0.32  |  29862855 |         0.00 |          0.00 | `minus`              |
|  1.38 |              16.11 |         0.28  | 300124024 |         0.00 |          0.00 | `assert_range`       |
|  1.38 |              16.39 |         0.28  |  75315259 |         0.00 |          0.00 | `assq`               |
|  1.18 |              16.62 |         0.24  |  75280752 |         0.00 |          0.00 | `lookup_binding`     |
|  1.08 |              16.84 |         0.22  |  75280689 |         0.00 |          0.00 | `make_binding_`      |
|  1.05 |              17.05 |         0.21  | 269866060 |         0.00 |          0.00 | `make_number`        |
|  1.05 |              17.26 |         0.21  |       173 |         1.21 |          1.21 | `macro_set_x`        |
|  1.00 |              17.46 |         0.20  | 135178741 |         0.00 |          0.00 | `gc_check`           |
|  0.90 |              17.64 |         0.18  |  30093052 |         0.00 |          0.00 | `pairlis`            |
|  0.85 |              17.81 |         0.17  | 105069896 |         0.00 |          0.00 | `check_formals`      |
|  0.83 |              17.97 |         0.17  |  29861090 |         0.00 |          0.00 | `less_p`             |
|   ... |                ... |          ...  |       ... |          ... |           ... | ...                  |
|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          0.00 | `open_boot`          |
|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          3.09 | `read_boot`          |
|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          0.00 | `reader_read_octal`  |
|  0.00 |              19.99 |         0.00  |         1 |         0.00 |          0.00 | `try_open_boot`      |

###### Gprof flat-profile for `mescc-mes`

| %time | cumulative seconds | self seconds  |     calls  | self s/call | total s/call | name                 |
|------:|-------------------:|--------------:|-----------:|------------:|-------------:|:---------------------|
| 31.64 |              58.01 |         58.01 |         81 |        0.72 |         1.91 | `eval_apply`         |
|  7.53 |              71.82 |         13.81 |  428598051 |        0.00 |         0.00 | `assq`               |
|  6.11 |              83.03 |         11.21 | 2789709236 |        0.00 |         0.00 | `make_cell`          |
|  5.96 |              93.95 |         10.92 | 2109168010 |        0.00 |         0.00 | `gc_push_frame`      |
|  4.88 |             102.89 |          8.94 |  607787117 |        0.00 |         0.00 | `length__`           |
|  4.11 |             110.43 |          7.53 | 2109168010 |        0.00 |         0.00 | `gc_peek_frame`      |
|  3.61 |             117.04 |          6.61 |  891765473 |        0.00 |         0.00 | `struct_ref_`        |
|  3.39 |             123.25 |          6.22 | 2109164356 |        0.00 |         0.00 | `push_cc`            |
|  3.39 |             129.46 |          6.21 |            |             |              | `_init`              |
|  2.27 |             133.63 |          4.17 |    1742421 |        0.00 |         0.00 | `assoc_string`       |
|  1.96 |             137.22 |          3.59 | 2235192336 |        0.00 |         0.00 | `cons`               |
|  1.79 |             140.50 |          3.28 |  215232553 |        0.00 |         0.00 | `apply_builtin`      |
|  1.54 |             143.33 |          2.83 |  679409558 |        0.00 |         0.00 | `make_value_cell`    |
|  1.49 |             146.06 |          2.73 |  200300883 |        0.00 |         0.00 | `pairlis`            |
|  1.18 |             148.22 |          2.16 | 2109168010 |        0.00 |         0.00 | `gc_pop_frame`       |
|  1.13 |             150.28 |          2.07 |  925236148 |        0.00 |         0.00 | `cell_ref`           |
|  1.10 |             152.30 |          2.02 |  196474232 |        0.00 |         0.00 | `gc_copy`            |
|  1.07 |             154.26 |          1.96 |  419658520 |        0.00 |         0.00 | `lookup_value`       |
|  1.04 |             156.17 |          1.91 |  411189650 |        0.00 |         0.00 | `check_formals`      |
|  1.02 |             158.03 |          1.87 |  423486766 |        0.00 |         0.00 | `lookup_binding`     |
|  0.98 |             159.82 |          1.79 |        183 |        0.01 |         0.01 | `gc_cellcpy`         |
|  0.91 |             161.50 |          1.68 |  891765473 |        0.00 |         0.00 | `assert_struct`      |
|  0.86 |             163.08 |          1.58 |  423486608 |        0.00 |         0.00 | `make_binding_`      |
|  0.78 |             164.51 |          1.43 |        183 |        0.01 |         0.03 | `gc_loop`            |
|  0.78 |             165.94 |          1.43 |  635403488 |        0.00 |         0.00 | `gc_check`           |
|   ... |                ... |          ...  |        ... |         ... |          ... |  ...                 |
|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `open_boot`          |
|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `open_output_file`   |
|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `read_boot`          |
|  0.00 |             183.37 |          0.00 |          1 |        0.00 |         0.00 | `reader_read_octal`  |

#### The profiling data in context

The functions in the top are mostly the same in both cases: `eval_apply`,
`gc_push_frame`, `push_cc`, `gc_peek_frame` and `make_cell`. They represent
around the 50% of the execution time (51.21% in `fib` and 51.28% in
`mescc-mes`) and they are closely related.

Before trying to explain why, we could spend a minute checking how they are
eating that amount of time. `eval_apply` is executed a few times but it takes a
long time to run, while the rest of them are very fast functions (0.00 ms!) but
they are executed millions of times per second.

The reason for that is `eval_apply` is a long running loop that is calling
them. It's calling them **all the time**.

##### Deep in `eval_apply`

As I already explained in earlier episodes, `mes` is a tree-walk interpreter,
but that doesn't say much about how things work.

The current design of `mes` is very interesting. I didn't ask the author about
it but it feels like it was written in Scheme first and translated to C later.
I say this because the `eval_apply` function, the core of the program, that
expands the macros, traverses the tree, and does all the `eval`ing and the
`apply`ing, is written as it would be written in Scheme: abusing the tail-call
optimization[^tco] with mutually recursive functions.

[^tco]: Way before all the Scheme, compilers, interpreters and Guix, [I
    wrote a post about it]({filename}/call-me-maybe.md).

That's pretty understandable, if you are writing a Scheme interpreter you'd
like to write Scheme, so you Schemeify your C as much as you can. Also, it
happens that Scheme can be very elegantly described in Scheme, as the [GNU Mes
manual mentions][mes-manual], and most of the literature about Scheme
interpreters uses Scheme. This tradition goes back to the SICP, and even
further to the invention (discovery?) of LISP itself.

[mes-manual]: https://www.gnu.org/software/mes/manual/mes.html#LISP-as-Maxwell_0027s-Equations-of-Software

But C is not very compatible with that programming style. The C stack would
overflow due to the lack of tail-call optimization.

GNU Mes overcomes that limitation manually manipulating a stack, that is
allocated by its memory manager, and emulating function calls by doing a
`goto`. The argument passing to these "functions" is done with some global
variables that are understood as "registers" in this "virtual machine".

Consider this piece of code:

``` clike
evlis:
  if (R1 == cell_nil)
    goto vm_return;
  if (R1->type != TPAIR)
    goto eval;
  push_cc (R1->car, R1, R0, cell_vm_evlis2);
  goto eval;
evlis2:
  push_cc (R2->cdr, R1, R0, cell_vm_evlis3);
  goto evlis;
evlis3:
  R1 = cons (R2, R1);
  goto vm_return;
```

The "registers" are those `R1`, `R0` and so on, and we don't really need to see
what they mean to understand how this whole thing works.

Imagine we enter `evlis` with an `R1` that is a list: it's not a `cell_nil` and
it's type is `TPAIR`. So we skip both `if` statements and we reach the
`push_cc`. This call will push to the stack our current position in the code,
`cell_vm_evlis2`, and then jump to the `eval` "function". `eval` eventually
ends, doing a `goto vm_return` which pops the current frame and jumps to the
top of the `eval_apply` function.

``` clike
vm_return:
  x = R1;         // This saves the return value of the "call"
  gc_pop_frame ();
  R1 = x;         // and puts it back, after it has been overwritten by the pop
  goto eval_apply;
```

And, what do we find there?

A huge `if` cascade that brings us to the label where we left:

``` clike
eval_apply:
  if (R3 == cell_vm_evlis2)
    goto evlis2;     // Bingo!
  else if (R3 == cell_vm_evlis3)
    goto evlis3;
  else if (R3 == cell_vm_eval_check_func)
    goto eval_check_func;
  else if (R3 == cell_vm_eval2)
    goto eval2;
 // There are quite a few more below.
```

Didn't we just emulate a "function call"?

I'm sure you can continue with the execution from `evlis2` by yourselves, just
with the code I pasted here, and understand how a list is evaluated in `mes`.

It might feel crazy, but that's because during the explanation the goal of this
approach was left aside for a minute. I didn't mention that the `eval` label we
jumped to might be jumping to many other places, including `apply`, which would
jump to `eval` again, which, as we mentioned before, would blow up our stack if
we were doing this in C functions. Doing it this (weird) way lets you control
the stack and manually apply the tail-call optimization, i.e. just use `goto`
without the `push_cc`.

This whole thing is looping all the time, doing `goto eval_apply` over and over
again, and that's why the actual `eval_apply` function takes that long to run.
It's the main loop of the interpreter.

Now we are in a great position to understand why those other functions were
called that many times. `push_cc` is the easiest one, as it's what it prepares
a call to one of those internal "functions" that require pushing to the stack.
`gc_push_frame` is called by `push_cc` and is the function that actually does
the pushing. `gc_peek_frame` is also called many times, but we don't see that
in the listings I shared. What happens here is `gc_peek_frame` is called by
`gc_pop_frame`, exactly what the `vm_return` block does.

The `make_cell` case is slightly different but also related. It's called mostly
from `cons`, and it's sibling `make_value_cell` is also called many times.
These are the functions that allocate Scheme values. Of course, we cannot
simply eliminate the calls to these functions, because a Scheme interpreter
would always need to create values, but my bet is that these functions,
specially `cons` are called more often that strictly needed because of the
evaluation strategy I just described.

### Conclusions

There are many conclusions to obtain from all this data and the design of the
evaluator.

The first and more obvious one is that it looks like the project proposal makes
sense and **we could make `mes` way faster**, as Guile is, **even keeping the
interpreter small**, as shown in Chibi-Scheme. Of course, I don't expect to
obtain a 10x performance improvement in the first iteration, but I believe we
can do a lot *"starting with the simplest thing that could possibly
work*[^janneke]" if we keep a non-pessimization philosophy.

[^janneke]: I'm quoting Janneke, the author of GNU Mes, who has a very
    pragmatical approach to programming.

The question now is, what is *that* thing?

I proposed to turn `mes` into a **bytecode interpreter**. This alone might
sound as an overkill but from all the other options I considered it sounded
like the simplest, and there is evidence that it could possibly work: Guile and
Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks
we saw they can be very fast comparing to GNU Mes' Scheme interpreter.

Other possibilities like tree traversal memoization, like the ones applied in
older versions of Guile didn't feel like they would have such a profound impact
in the evaluation process, and I felt like they were going to be harder for me
to implement.

But still, there's a lot more we can obtain from the data.

When comparing `mes-gcc` and `mes-m2`, we could see that small tricks would not
work if we want to improve the speed of the bootstrapping process. It's not
that hard to write some code in a different way so GCC does its magic and
optimizes for a 10% improvement, but when it comes to M2-Planet we are alone,
there's no compiler optimization there to help us, and **we must be smart and
use optimizations that go deeper, those that happen in the CPU**. Bytecode
could give us some of that *data locality* we need.

We must remember the profiler was run on `mes-gcc`, but the information we
obtained from that can be extrapolated to the `mes-m2` case: **it is the
current implementation of `mes` that is inefficient by design as it was
designed to be simple, not fast**. All the `push`ing and the `pop`ing of scheme
values to that internal managed stack happens too often.

An evaluator that is able to run an already processed program is, potentially,
way faster than one that has to traverse the tree over and over again, while it
does all this `goto` soup, that is also pretty hard to read. More specifically,
a bytecode interpreter would not only reduce the execution time of those
functions on the top in the profiler's list, but many others that are related
to them in the very long tail of function calls that I didn't bother to paste
in this post. **This cascade effect could add up to a huge performance
improvement** further than that 50% of execution time we could find in those 5
functions on the top.

Everything that can be processed ahead of time should be done (macro expansion,
lambda hoisting, argument checking, compilation...) first and only once if
possible, so the evaluator can focus on its main goal: running the code, as
fast as possible.

What I didn't mention, though, are all the functions that don't look relevant
in the profiling: they are just good. The garbage collector and allocator in
`mes` simply works and goes unnoticed in the profiling with a 1% of execution
time in the `gc_copy` function and a very fast allocation that only becomes a
noticeable when a huge number (hundreds of millions) of values are allocated.

When it comes to readability and auditability, one of the goals of the project,
personally I consider the `eval_apply` function is very hard to follow. There
are still many tricks that I don't fully understand, and I already spent
several weeks reading and hacking on the interpreter, which isn't that large
after all. Rewriting it is a chance for making it easier to read.


### Work done

Understanding the interpreter internals isn't an immediate task, so in order to
get used to it I decided to go for some low-hanging fruit first.

First, I moved some operations to builtins and even created some. The
performance gain I saw was simply **zero**. Probably because the benchmark is
not designed to detect small improvements and because the main performance
bottleneck is in the `eval_apply` function.

Same thing happened when updated the M2-Planet and I added `switch` statements
that replace chained `if`s that were there because of M2-Planet didn't support
`switch` statements in previous versions. The change affected `mes-gcc`'s
performance slightly, but not enough to consider it a big win. `mes-m2`
remained mostly the same as it doesn't apply any kind of optimization in
`switch` statements. I still consider this change important for readability, so
much so it helped me detect a bug in the string reader (the newline escaping
was broken). That I also fixed.

The `eval_apply` function has been a puzzle to me for a long time to the point
that I started to look to any possibility to make it shorter, simpler or
anything that could help me understand it. In that process I found some very
specific cases in that `goto` soup that mentioned `pmatch`. These were the
result of a hack that avoided a hygiene issue in the `pmatch` macro. I changed
the variable names in the macro and removed those special cases, slightly
reducing the complexity of the `eval_apply` function and bringing the attention
to the lack of hygiene in our macro expander, which, by the way, is also
intertwined in the `eval_apply` function.

Well, and I wrote the benchmark tool, that happens to be very useful as a test
for all these small changes, too.

All the changes you can find in the `faster` branch in the following
repo[^regret]:

<http://github.com/ekaitz-zarraga/mes>

[^regret]: What you cannot see is all the changes I tried to do and I regretted
    later.

### Next steps

None of the work done really tackles the root issue but hacking around is
important to familiarize yourself with the code, and I'm the kind of person
that requires quite a bit of context before getting into action.

I think now I'm ready for the real deal: **the evaluator swallows most of the
time, and we need to change it**. We just need a plan for it.

First I have to prepare for the big change. Removing the macro expansion from
the famous `eval_apply` function and reducing it to a plain Scheme evaluator
(renamed to `eval_core`) that can only work with the basic forms and the
builtins would help me in two fronts: working on the macro expansion alone,
opening the door for writing a fully hygienic macro expander from scratch; and
reducing the `eval_core` function even more so it can be replaced later.

Currently, as I write this lines, I just finished the removal. I can `git show`
and see all the macro expansion related functionality in red.

Now I can get to the work. I have to write a macro expander. I'm yet to decide
if I should write it first in Scheme (without macros) and translate it later or
if I should write it directly in C. In any case, the macro expansion step would
be separated from the evaluation.

The separation is not absolute, because the macro expander would make use of
the brand new `eval_core` function to run the macros, but at least we would
have the two concerns clearly split in the code. Consider this the 0 step.

With the split, it would be time to start thinking about the compilation and
evaluation. In the first thought I didn't know very well what to do, because if
the evaluator works on bytecode, how would I expand the macros then?

I gave this too much thought, but it wasn't that difficult, really. If the
macro expander is not only a macro expander but also a compiler, the body of
the macros can be compiled as they come. The macro expansion would still work
with the `eval_core` function, the same as it did, but now it wouldn't eval
Scheme code, but bytecode.

Even more, we could compile all the Scheme code as we read it for macro
expansion, so we could use helper functions from the current file in the macro
expansion step, and we would traverse the tree only once.

Time for writing a macro expander, then.

Another thing I never did before. I'm sure it will be a lot of fun.