I’m currently in the market for an LLM eval system / project / pipeline I can use in my Go- and LLM-powered applications. This is a post to gather my thoughts and expose them to the wide internet. Think of it like a stream-of-thought post that might give you some ideas and food for thought as well.
Evals are kind of like traditional software tests, but also they aren’t.
They are something I want to run automatically on code changes, preferably in my CI system, and which can give me an idea of whether changes in the code and prompts are for the better or for the worse. Both for the current changes, but also as a trend across time.
But they are indeterministic in a way we as deterministic-code-developers aren’t as used to yet. They don’t necessarily need to pass all the time, at least not all of them. They can diverge from their previously determined result and still be acceptable. (I think? Maybe it really is a boolean result we’re after?)
I don’t want this to be a part of some platform where I need to log in to see my changes. I want it as part of my regular pipeline, which I think is important. Third party platforms have the advantage of being able to incorporate real data from real users via some form of telemetry, which is nice, but I think the core signal on whether changes are good or bad should live within the project itself. Just like unit tests.
But the “real data” aspect is interesting, and reminds me of fuzz testing. Given a prompt and some real-world context data, what is the output? Can this be fed back into the project as some form of eval test corpus?
This is the system I think I want:
If an issue comes up in the deployed app, I want to be able to reproduce the problem, add it to my data corpus for evals, fix the problem, and make the app more robust. This isn’t that different from how I would normally fix a bug.
Go has a great testing toolchain built into the official language tooling. If eval output is binary (pass or fail), just like test cases, this would mean that what I really need is code that maps the result from a language model to a pass/fail. Then the test harness and existing tooling can take care of the rest, and there’s no need to reinvent the wheel in terms of tooling, output handling, etc.
This would basically be testify on steroids tailored to LLMs. Embeddings similarity checking (cosine similarity or similar), LLM-as-a-judge, and probably other scoring methods. Human evaluators would not be possible in this way, I think, since there’s no way to pause and wait for evaluation as part of the test harness. (Unless, of course, they could just wait for input indefinitely and continue when available. But I’m sure there are many timeouts in the pipeline that would have passed.)
I like conceptually simple systems. That’s why I’m attracted to Go, SQLite, and more, and not particularly drawn to big platform-like tools/frameworks like LangChain. Maybe what I need is just an evals Go library with a relatively small API surface area that can’t do everything but enough to be useful?
I’m Markus, an independent software consultant and developer. 🤓✨ Reach me at markus@maragu.dk.
Podcast on Apple Podcasts and Spotify. Streaming on Youtube. Subscribe to this blog by RSS or newsletter: