diff options
author | Ekaitz Zarraga <ekaitz@elenq.tech> | 2025-06-03 00:48:17 +0200 |
---|---|---|
committer | Ekaitz Zarraga <ekaitz@elenq.tech> | 2025-06-04 23:31:02 +0200 |
commit | 3a3322ca63c300abc56c4a5452150fdfdd73cd44 (patch) | |
tree | bf2dd0485935e2d7ca00a30a17e9b58595ee1b69 /content | |
parent | 39af79fd69c15fa001129567ec6508a5c0a95177 (diff) |
fasterMes: 01_performance_report.md: Add it.
Diffstat (limited to 'content')
-rw-r--r-- | content/fasterMes/01_performance_report.md | 547 |
1 files changed, 547 insertions, 0 deletions
diff --git a/content/fasterMes/01_performance_report.md b/content/fasterMes/01_performance_report.md new file mode 100644 index 0000000..fb219b8 --- /dev/null +++ b/content/fasterMes/01_performance_report.md @@ -0,0 +1,547 @@ +Title: Understanding GNU Mes' performance +Date: 2025-06-02 +Category: +Tags: GNU Mes interpreter speedup +Slug: fasterMes1 +Lang: en +Summary: How fast is GNU Mes' scheme interpreter really? +Status: draft + +Before[^heh] starting our work in [the project][project] we should, first, make +sure that we know that our hypothesis is true and GNU Mes could be faster and, +second, try to understand where we could improve the execution speed. + +[^heh]: Hypothetically speaking, because the work has already started. But + ssssh... + +[project]: /tag/gnu-mes-interpreter-speedup.html + +### The benchmarking tool + +In order to check how fast GNU Mes is, I created a very simple benchmarking +tool. The tool just runs GNU Mes under `time` (the command, not the built-in), +and creates a report in a bunch of text files. Of course, `time` is not a great +way to benchmark, but my hypothesis was that `mes` (remember, the scheme +interpreter in GNU Mes) could be way faster than it is, so, if that's true, +`time` should be more than enough for the comparison. + +This is done in the `simple.make` makefile, that has been revived to serve this +development period. The `simple.make` makefile is important because normally +GNU Mes is built from scratch, including the `meslibc` (remember, the C +standard library in GNU Mes), which takes a long time. The `simple.make` file +builds just `mes`, the interpreter, from GCC or M2-Planet, so the iteration +time is acceptable during the development, meaning it's almost instantaneous. + +So, in summary this: + +``` bash +make -f simple.make benchmark +``` + +Would run the benchmark for the default `mes`, the one built by GCC (which is +of course built if needed, too). But we also have `benchmark-m2`, for the `mes` +built by M2-Planet. + +The benchmark at this very moment runs three cases: + +- `fib`: A simple scheme program that finds the 35th element of the Fibonacci + sequence. +- `mescc-hello`: A `mescc` call that compiles a simple hello-world program + written in C. +- `mescc-mes`: A `mescc` call that compiles `mes`'s source code. + +These three cases cover very basic operations but let us know if our changes are +relevant in `mes`'s main job: compiling some C code. But also give us some +interesting feedback about the evaluation speed in the `fib` check, specially +in a very recursive program. The `mescc-hello` is a minimal test that shows us +the startup speed of the `mescc` compiler. + +However, that alone doesn't give us a lot of information, because we don't +have anything to compare with. The scheme interpreter in GNU Mes is compatible +with Guile meaning that it could be replaced with Guile in any moment (maybe +not the other way around). There it is our great candidate for comparison. + +#### The data + +The following table shows the execution of all the benchmarks' results for +wall-time written in `Average ±StdDev` form. Each benchmark is run 10 times in +my laptop (*i7-10510U*). + +| | mes-gcc | mes-m2 | guile | guile<sup>†</sup> | guile<sup>†‡</sup> | +|:--------------|-------------:|-------------:|------------:|------------------:|----------------------:| +| Fib | 21.39 ±0.66 | 49.13 ±0.11 | 0.27 ±0.00 | 3.57 ±0.27 | 8.67 ±0.38 | +| Mescc-hello | 1.38 ±0.04 | 2.69 ±0.01 | 0.41 ±0.01 | 0.45 ±0.03 | 1.03 ±0.04 | +| Mescc-mes | 103.39 ±3.03 | 214.88 ±0.34 | 17.03 ±0.30 | 18.70 ±0.95 | 27.34 ±1.09 | + +<dl> +<dt>†</dt> <dd>With <code>GUILE_AUTO_COMPILE=0</code></dd> +<dt>‡</dt> <dd>With <code>GUILE_JIT_THRESHOLD=-1</code></dd> +</dl> + +From left to right: _mes-gcc_ means `mes` built with default GCC (including +optimizations), _mes-m2_ means `mes` built with M2-Planet and _guile_ means +running `guile` (`v3.0.9`) instead of `mes`. + +The first conclusion is pretty obvious: + +> Guile is way faster than Mes when compiling C code, even with all the +> optimizations disabled. + +##### Unfair! + +But that conclusion may look unfair to most of you, specially because Guile is +a pretty large and powerful scheme implementation, while Mes is a simple scheme +interpreter designed to be as minimal as possible. + +Chibi-Scheme is a very interesting small scheme interpreter that happens to be +somewhat slow comparing with other major scheme implementations. In terms of +lines of code, `chibi-scheme` is around the twice the size of `mes`, which is +still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is, +so we can only run the `fib` benchmark with it. The following table shows a +comparison between the best Mes scenario vs Chibi-Scheme. + +| | Mes | Chibi-Scheme | +|:---------------------------------|------------:|-------------:| +| Lines of Code<sup>†</sup> | 5838 | 13135 | +| Fib<sup>‡</sup> (s) | 21.39 ±0.66 | 1.77 ±0.03 | + +<dl> +<dt>†</dt> <dd>Only counts the lines in <code>.c</code> files</dd> +<dt>‡</dt> <dd>Both built with GCC</dd> +</dl> + +That's quite an improvement for the size! + +#### Conclusion from the benchmark + +Before taking any conclusion, we should make sure our assumption is true and +the differences are high enough to be captured by the `time` command. Looking +to the orders of magnitude we are dealing with here, I would say **the data we +obtained is a valid reference point**, as the differences are large enough to +be properly captured by `time` and our goal is to have a noticeable performance +improvement. + +From there, we understand the obvious: Guile is way faster than Mes. Even with +compilation disabled, but that being a huge difference. Chibi-Scheme is also +way faster, but not as fast as Guile with all optimizations enabled. + +When it comes to the compiler used to build Mes, when compiled with M2 `mes` is +twice as slow as with GCC. This is a very relevant detail to keep in mind +during development. GCC's optimizations account for a huge performance +improvement in `mes`, and we cannot rely on them during the bootstrapping +process. + + +### Profiling + +Still, we are far from justifying a full rewrite of the evaluator, and even +further from rewriting `mes` as a bytecode interpreter, as proposed in the +project presentation. That's why I decided to run a profiler, again, the +simplest possible, so I could get some insights from the execution of the +program. + +I run `gprof` in `mes` for both `fib` and `mescc-mes` cases[^benchmark-useful] +and I summarize the output of the flat-profile below, right after a little +reminder of what do all the columns in the table mean. + +Just to clarify, `gprof` requires the program to be built with GCC with `-pg` +flag, we don't have profilers for the program built M2-Planet, but I think the +insights obtained from this could be extrapolated to `mes-m2` in most cases. +You'll see why in a minute. + +[^benchmark-useful]: See how the benchmark is useful? + +###### Legend + +<dl> +<dt>% time</dt> +<dd>the percentage of the total running time of the program +used by this function.</dd> + +<dt>cumulative seconds</dt> +<dd>a running sum of the number of seconds accounted for by this +function and those listed above it.</dd> + +<dt>self seconds</dt> +<dd>the number of seconds accounted for by this function alone. This is +the major sort for this listing.</dd> + +<dt>calls</dt> +<dd>the number of times this function was invoked, if this function is +profiled, else blank.</dd> + +<dt>self ms/call</dt> +<dd>the average number of milliseconds spent in this function per call, +if this function is profiled, else blank.</dd> + +<dt>total ms/call</dt> +<dd>the average number of milliseconds spent in this function and its +descendents per call, if this function is profiled, else blank.</dd> + +<dt>name</dt> +<dd>the name of the function. This is the minor sort for this listing. +The index shows the location of the function in the gprof listing. If +the index is in parenthesis it shows where it would appear in the gprof +listing if it were to be printed.</dd> +</dl> + +###### Gprof flat-profile for `fib` + +| %time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name | +|------:|-------------------:|--------------:|----------:|-------------:|--------------:|:---------------------| +| 20.06 | 4.01 | 4.01 | 37 | 108.38 | 416.96 | `eval_apply` | +| 11.26 | 6.26 | 2.25 | 600202232 | 0.00 | 0.00 | `gc_push_frame` | +| 7.10 | 7.68 | 1.42 | 600202232 | 0.00 | 0.00 | `gc_peek_frame` | +| 6.83 | 9.05 | 1.37 | 481504722 | 0.00 | 0.00 | `make_cell` | +| 6.03 | 10.25 | 1.21 | 600200793 | 0.00 | 0.00 | `push_cc` | +| 5.48 | 11.35 | 1.10 | 300124024 | 0.00 | 0.00 | `struct_ref_` | +| 4.70 | 12.29 | 0.94 | | | | `_init` | +| 3.70 | 13.03 | 0.74 | 269868352 | 0.00 | 0.00 | `make_value_cell` | +| 2.35 | 13.50 | 0.47 | 600202232 | 0.00 | 0.00 | `gc_pop_frame` | +| 2.15 | 13.93 | 0.43 | 406099172 | 0.00 | 0.00 | `cons` | +| 2.15 | 14.36 | 0.43 | 74983231 | 0.00 | 0.00 | `apply_builtin` | +| 2.00 | 14.76 | 0.40 | 300279513 | 0.00 | 0.00 | `cell_ref` | +| 1.95 | 15.15 | 0.39 | 300124024 | 0.00 | 0.00 | `assert_struct` | +| 1.85 | 15.52 | 0.37 | 135158343 | 0.00 | 0.00 | `length__` | +| 1.60 | 15.84 | 0.32 | 29862855 | 0.00 | 0.00 | `minus` | +| 1.38 | 16.11 | 0.28 | 300124024 | 0.00 | 0.00 | `assert_range` | +| 1.38 | 16.39 | 0.28 | 75315259 | 0.00 | 0.00 | `assq` | +| 1.18 | 16.62 | 0.24 | 75280752 | 0.00 | 0.00 | `lookup_binding` | +| 1.08 | 16.84 | 0.22 | 75280689 | 0.00 | 0.00 | `make_binding_` | +| 1.05 | 17.05 | 0.21 | 269866060 | 0.00 | 0.00 | `make_number` | +| 1.05 | 17.26 | 0.21 | 173 | 1.21 | 1.21 | `macro_set_x` | +| 1.00 | 17.46 | 0.20 | 135178741 | 0.00 | 0.00 | `gc_check` | +| 0.90 | 17.64 | 0.18 | 30093052 | 0.00 | 0.00 | `pairlis` | +| 0.85 | 17.81 | 0.17 | 105069896 | 0.00 | 0.00 | `check_formals` | +| 0.83 | 17.97 | 0.17 | 29861090 | 0.00 | 0.00 | `less_p` | +| ... | ... | ... | ... | ... | ... | ... | +| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `open_boot` | +| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 3.09 | `read_boot` | +| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `reader_read_octal` | +| 0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | `try_open_boot` | + +###### Gprof flat-profile for `mescc-mes` + +| %time | cumulative seconds | self seconds | calls | self s/call | total s/call | name | +|------:|-------------------:|--------------:|-----------:|------------:|-------------:|:---------------------| +| 31.64 | 58.01 | 58.01 | 81 | 0.72 | 1.91 | `eval_apply` | +| 7.53 | 71.82 | 13.81 | 428598051 | 0.00 | 0.00 | `assq` | +| 6.11 | 83.03 | 11.21 | 2789709236 | 0.00 | 0.00 | `make_cell` | +| 5.96 | 93.95 | 10.92 | 2109168010 | 0.00 | 0.00 | `gc_push_frame` | +| 4.88 | 102.89 | 8.94 | 607787117 | 0.00 | 0.00 | `length__` | +| 4.11 | 110.43 | 7.53 | 2109168010 | 0.00 | 0.00 | `gc_peek_frame` | +| 3.61 | 117.04 | 6.61 | 891765473 | 0.00 | 0.00 | `struct_ref_` | +| 3.39 | 123.25 | 6.22 | 2109164356 | 0.00 | 0.00 | `push_cc` | +| 3.39 | 129.46 | 6.21 | | | | `_init` | +| 2.27 | 133.63 | 4.17 | 1742421 | 0.00 | 0.00 | `assoc_string` | +| 1.96 | 137.22 | 3.59 | 2235192336 | 0.00 | 0.00 | `cons` | +| 1.79 | 140.50 | 3.28 | 215232553 | 0.00 | 0.00 | `apply_builtin` | +| 1.54 | 143.33 | 2.83 | 679409558 | 0.00 | 0.00 | `make_value_cell` | +| 1.49 | 146.06 | 2.73 | 200300883 | 0.00 | 0.00 | `pairlis` | +| 1.18 | 148.22 | 2.16 | 2109168010 | 0.00 | 0.00 | `gc_pop_frame` | +| 1.13 | 150.28 | 2.07 | 925236148 | 0.00 | 0.00 | `cell_ref` | +| 1.10 | 152.30 | 2.02 | 196474232 | 0.00 | 0.00 | `gc_copy` | +| 1.07 | 154.26 | 1.96 | 419658520 | 0.00 | 0.00 | `lookup_value` | +| 1.04 | 156.17 | 1.91 | 411189650 | 0.00 | 0.00 | `check_formals` | +| 1.02 | 158.03 | 1.87 | 423486766 | 0.00 | 0.00 | `lookup_binding` | +| 0.98 | 159.82 | 1.79 | 183 | 0.01 | 0.01 | `gc_cellcpy` | +| 0.91 | 161.50 | 1.68 | 891765473 | 0.00 | 0.00 | `assert_struct` | +| 0.86 | 163.08 | 1.58 | 423486608 | 0.00 | 0.00 | `make_binding_` | +| 0.78 | 164.51 | 1.43 | 183 | 0.01 | 0.03 | `gc_loop` | +| 0.78 | 165.94 | 1.43 | 635403488 | 0.00 | 0.00 | `gc_check` | +| ... | ... | ... | ... | ... | ... | ... | +| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `open_boot` | +| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `open_output_file` | +| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `read_boot` | +| 0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | `reader_read_octal` | + +#### The profiling data in context + +The functions in the top are mostly the same in both cases: `eval_apply`, +`gc_push_frame`, `push_cc`, `gc_peek_frame` and `make_cell`. They represent +around the 50% of the execution time (51.21% in `fib` and 51.28% in +`mescc-mes`) and they are closely related. + +Before trying to explain why, we could spend a minute checking how they are +eating that amount of time. `eval_apply` is executed a few times but it takes a +long time to run, while the rest of them are very fast functions (0.00 ms!) but +they are executed millions of times per second. + +The reason for that is `eval_apply` is a long running loop that is calling +them. It's calling them **all the time**. + +##### Deep in `eval_apply` + +As I already explained in earlier episodes, `mes` is a tree-walk interpreter, +but that doesn't say much about how things work. + +The current design of `mes` is very interesting. I didn't ask the author about +it but it feels like it was written in Scheme first and translated to C later. +I say this because the `eval_apply` function, the core of the program, that +expands the macros, traverses the tree, and does all the `eval`ing and the +`apply`ing, is written as it would be written in Scheme: abusing the tail-call +optimization[^tco] with mutually recursive functions. + +[^tco]: Way before all the Scheme, compilers, interpreters and Guix, [I + wrote a post about it]({filename}/call-me-maybe.md). + +That's pretty understandable, if you are writing a Scheme interpreter you'd +like to write Scheme, so you Schemeify your C as much as you can. Also, it +happens that Scheme can be very elegantly described in Scheme, as the [GNU Mes +manual mentions][mes-manual], and most of the literature about Scheme +interpreters uses Scheme. This tradition goes back to the SICP, and even +further to the invention (discovery?) of LISP itself. + +[mes-manual]: https://www.gnu.org/software/mes/manual/mes.html#LISP-as-Maxwell_0027s-Equations-of-Software + +But C is not very compatible with that programming style. The C stack would +overflow due to the lack of tail-call optimization. + +GNU Mes overcomes that limitation manually manipulating a stack, that is +allocated by its memory manager, and emulating function calls by doing a +`goto`. The argument passing to these "functions" is done with some global +variables that are understood as "registers" in this "virtual machine". + +Consider this piece of code: + +``` clike +evlis: + if (R1 == cell_nil) + goto vm_return; + if (R1->type != TPAIR) + goto eval; + push_cc (R1->car, R1, R0, cell_vm_evlis2); + goto eval; +evlis2: + push_cc (R2->cdr, R1, R0, cell_vm_evlis3); + goto evlis; +evlis3: + R1 = cons (R2, R1); + goto vm_return; +``` + +The "registers" are those `R1`, `R0` and so on, and we don't really need to see +what they mean to understand how this whole thing works. + +Imagine we enter `evlis` with an `R1` that is a list: it's not a `cell_nil` and +it's type is `TPAIR`. So we skip both `if` statements and we reach the +`push_cc`. This call will push to the stack our current position in the code, +`cell_vm_evlis2`, and then jump to the `eval` "function". `eval` eventually +ends, doing a `goto vm_return` which pops the current frame and jumps to the +top of the `eval_apply` function. + +``` clike +vm_return: + x = R1; // This saves the return value of the "call" + gc_pop_frame (); + R1 = x; // and puts it back, after it has been overwritten by the pop + goto eval_apply; +``` + +And, what do we find there? + +A huge `if` cascade that brings us to the label where we left: + +``` clike +eval_apply: + if (R3 == cell_vm_evlis2) + goto evlis2; // Bingo! + else if (R3 == cell_vm_evlis3) + goto evlis3; + else if (R3 == cell_vm_eval_check_func) + goto eval_check_func; + else if (R3 == cell_vm_eval2) + goto eval2; + // There are quite a few more below. +``` + +Didn't we just emulate a "function call"? + +I'm sure you can continue with the execution from `evlis2` by yourselves, just +with the code I pasted here, and understand how a list is evaluated in `mes`. + +It might feel crazy, but that's because during the explanation the goal of this +approach was left aside for a minute. I didn't mention that the `eval` label we +jumped to might be jumping to many other places, including `apply`, which would +jump to `eval` again, which, as we mentioned before, would blow up our stack if +we were doing this in C functions. Doing it this (weird) way lets you control +the stack and manually apply the tail-call optimization, i.e. just use `goto` +without the `push_cc`. + +This whole thing is looping all the time, doing `goto eval_apply` over and over +again, and that's why the actual `eval_apply` function takes that long to run. +It's the main loop of the interpreter. + +Now we are in a great position to understand why those other functions were +called that many times. `push_cc` is the easiest one, as it's what it prepares +a call to one of those internal "functions" that require pushing to the stack. +`gc_push_frame` is called by `push_cc` and is the function that actually does +the pushing. `gc_peek_frame` is also called many times, but we don't see that +in the listings I shared. What happens here is `gc_peek_frame` is called by +`gc_pop_frame`, exactly what the `vm_return` block does. + +The `make_cell` case is slightly different but also related. It's called mostly +from `cons`, and it's sibling `make_value_cell` is also called many times. +These are the functions that allocate Scheme values. Of course, we cannot +simply eliminate the calls to these functions, because a Scheme interpreter +would always need to create values, but my bet is that these functions, +specially `cons` are called more often that strictly needed because of the +evaluation strategy I just described. + +### Conclusions + +There are many conclusions to obtain from all this data and the design of the +evaluator. + +The first and more obvious one is that it looks like the project proposal makes +sense and **we could make `mes` way faster**, as Guile is, **even keeping the +interpreter small**, as shown in Chibi-Scheme. Of course, I don't expect to +obtain a 10x performance improvement in the first iteration, but I believe we +can do a lot *"starting with the simplest thing that could possibly +work*[^janneke]" if we keep a non-pessimization philosophy. + +[^janneke]: I'm quoting Janneke, the author of GNU Mes, who has a very + pragmatical approach to programming. + +The question now is, what is *that* thing? + +I proposed to turn `mes` into a **bytecode interpreter**. This alone might +sound as an overkill but from all the other options I considered it sounded +like the simplest, and there is evidence that it could possibly work: Guile and +Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks +we saw they can be very fast comparing to GNU Mes' Scheme interpreter. + +Other possibilities like tree traversal memoization, like the ones applied in +older versions of Guile didn't feel like they would have such a profound impact +in the evaluation process, and I felt like they were going to be harder for me +to implement. + +But still, there's a lot more we can obtain from the data. + +When comparing `mes-gcc` and `mes-m2`, we could see that small tricks would not +work if we want to improve the speed of the bootstrapping process. It's not +that hard to write some code in a different way so GCC does its magic and +optimizes for a 10% improvement, but when it comes to M2-Planet we are alone, +there's no compiler optimization there to help us, and **we must be smart and +use optimizations that go deeper, those that happen in the CPU**. Bytecode +could give us some of that *data locality* we need. + +We must remember the profiler was run on `mes-gcc`, but the information we +obtained from that can be extrapolated to the `mes-m2` case: **it is the +current implementation of `mes` that is inefficient by design as it was +designed to be simple, not fast**. All the `push`ing and the `pop`ing of scheme +values to that internal managed stack happens too often. + +An evaluator that is able to run an already processed program is, potentially, +way faster than one that has to traverse the tree over and over again, while it +does all this `goto` soup, that is also pretty hard to read. More specifically, +a bytecode interpreter would not only reduce the execution time of those +functions on the top in the profiler's list, but many others that are related +to them in the very long tail of function calls that I didn't bother to paste +in this post. **This cascade effect could add up to a huge performance +improvement** further than that 50% of execution time we could find in those 5 +functions on the top. + +Everything that can be processed ahead of time should be done (macro expansion, +lambda hoisting, argument checking, compilation...) first and only once if +possible, so the evaluator can focus on its main goal: running the code, as +fast as possible. + +What I didn't mention, though, are all the functions that don't look relevant +in the profiling: they are just good. The garbage collector and allocator in +`mes` simply works and goes unnoticed in the profiling with a 1% of execution +time in the `gc_copy` function and a very fast allocation that only becomes a +noticeable when a huge number (hundreds of millions) of values are allocated. + +When it comes to readability and auditability, one of the goals of the project, +personally I consider the `eval_apply` function is very hard to follow. There +are still many tricks that I don't fully understand, and I already spent +several weeks reading and hacking on the interpreter, which isn't that large +after all. Rewriting it is a chance for making it easier to read. + + +### Work done + +Understanding the interpreter internals isn't an immediate task, so in order to +get used to it I decided to go for some low-hanging fruit first. + +First, I moved some operations to builtins and even created some. The +performance gain I saw was simply **zero**. Probably because the benchmark is +not designed to detect small improvements and because the main performance +bottleneck is in the `eval_apply` function. + +Same thing happened when updated the M2-Planet and I added `switch` statements +that replace chained `if`s that were there because of M2-Planet didn't support +`switch` statements in previous versions. The change affected `mes-gcc`'s +performance slightly, but not enough to consider it a big win. `mes-m2` +remained mostly the same as it doesn't apply any kind of optimization in +`switch` statements. I still consider this change important for readability, so +much so it helped me detect a bug in the string reader (the newline escaping +was broken). That I also fixed. + +The `eval_apply` function has been a puzzle to me for a long time to the point +that I started to look to any possibility to make it shorter, simpler or +anything that could help me understand it. In that process I found some very +specific cases in that `goto` soup that mentioned `pmatch`. These were the +result of a hack that avoided a hygiene issue in the `pmatch` macro. I changed +the variable names in the macro and removed those special cases, slightly +reducing the complexity of the `eval_apply` function and bringing the attention +to the lack of hygiene in our macro expander, which, by the way, is also +intertwined in the `eval_apply` function. + +Well, and I wrote the benchmark tool, that happens to be very useful as a test +for all these small changes, too. + +All the changes you can find in the `faster` branch in the following +repo[^regret]: + +<http://github.com/ekaitz-zarraga/mes> + +[^regret]: What you cannot see is all the changes I tried to do and I regretted + later. + +### Next steps + +None of the work done really tackles the root issue but hacking around is +important to familiarize yourself with the code, and I'm the kind of person +that requires quite a bit of context before getting into action. + +I think now I'm ready for the real deal: **the evaluator swallows most of the +time, and we need to change it**. We just need a plan for it. + +First I have to prepare for the big change. Removing the macro expansion from +the famous `eval_apply` function and reducing it to a plain Scheme evaluator +(renamed to `eval_core`) that can only work with the basic forms and the +builtins would help me in two fronts: working on the macro expansion alone, +opening the door for writing a fully hygienic macro expander from scratch; and +reducing the `eval_core` function even more so it can be replaced later. + +Currently, as I write this lines, I just finished the removal. I can `git show` +and see all the macro expansion related functionality in red. + +Now I can get to the work. I have to write a macro expander. I'm yet to decide +if I should write it first in Scheme (without macros) and translate it later or +if I should write it directly in C. In any case, the macro expansion step would +be separated from the evaluation. + +The separation is not absolute, because the macro expander would make use of +the brand new `eval_core` function to run the macros, but at least we would +have the two concerns clearly split in the code. Consider this the 0 step. + +With the split, it would be time to start thinking about the compilation and +evaluation. In the first thought I didn't know very well what to do, because if +the evaluator works on bytecode, how would I expand the macros then? + +I gave this too much thought, but it wasn't that difficult, really. If the +macro expander is not only a macro expander but also a compiler, the body of +the macros can be compiled as they come. The macro expansion would still work +with the `eval_core` function, the same as it did, but now it wouldn't eval +Scheme code, but bytecode. + +Even more, we could compile all the Scheme code as we read it for macro +expansion, so we could use helper functions from the current file in the macro +expansion step, and we would traverse the tree only once. + +Time for writing a macro expander, then. + +Another thing I never did before. I'm sure it will be a lot of fun. |