<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Claudio Santini</title>
    <description>The latest articles on Forem by Claudio Santini (@hireclaudio).</description>
    <link>https://forem.com/hireclaudio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F290360%2F32220dab-09cd-465e-b842-6dd1911ffa58.png</url>
      <title>Forem: Claudio Santini</title>
      <link>https://forem.com/hireclaudio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hireclaudio"/>
    <language>en</language>
    <item>
      <title>Audiblez: Generate audiobooks from e-books https://claudio.uk/posts/audiblez-v4.html</title>
      <dc:creator>Claudio Santini</dc:creator>
      <pubDate>Tue, 18 Mar 2025 14:05:59 +0000</pubDate>
      <link>https://forem.com/hireclaudio/audiblez-generate-audiobooks-from-e-bookshttpsclaudioukpostsaudiblez-v4html-3anm</link>
      <guid>https://forem.com/hireclaudio/audiblez-generate-audiobooks-from-e-bookshttpsclaudioukpostsaudiblez-v4html-3anm</guid>
      <description></description>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Unvibe: A Python Test-Runner that generates valid code</title>
      <dc:creator>Claudio Santini</dc:creator>
      <pubDate>Tue, 18 Mar 2025 11:55:40 +0000</pubDate>
      <link>https://forem.com/hireclaudio/unvibe-a-python-test-runner-that-generates-valid-code-5286</link>
      <guid>https://forem.com/hireclaudio/unvibe-a-python-test-runner-that-generates-valid-code-5286</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vibe coding is fine for prototypes. When projects get complicated, vibing doesn't work&lt;/li&gt;
&lt;li&gt;[Unvibe](&lt;a href="https://claudio.uk/posts/unvibe.html" rel="noopener noreferrer"&gt;https://claudio.uk/posts/unvibe.html&lt;/a&gt;] is a Python Test Runner that quickly generates many alternative implementations for functions
and classes you annotate with &lt;code&gt;@ai&lt;/code&gt;, and re-runs your unit-tests until it finds a correct implementation.&lt;/li&gt;
&lt;li&gt;So you "Vibe the unit-tests", and then search a correct implementation.&lt;/li&gt;
&lt;li&gt;This scales to larger projects and applies to existing code bases, as long as they are decently unit-tested.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A different way to generate code with LLMs
&lt;/h2&gt;

&lt;p&gt;In my daily work as consultant, I'm often dealing with large pre-exising code bases.&lt;/p&gt;

&lt;p&gt;I use GitHub Copilot a lot.&lt;br&gt;
It's now basically indispensable, but I use it mostly for generating boilerplate code, or figuring out how to use a library.&lt;br&gt;
As the code gets more logically nested though, Copilot crumbles under the weight of complexity. It doesn't know how things should fit together in the project.&lt;/p&gt;

&lt;p&gt;Other AI tools like Cursor or Devon, are pretty good at generating quickly working prototypes,&lt;br&gt;
but they are not great at dealing with large existing codebases, and they have a very low success rate for my kind of daily work.&lt;br&gt;
You find yourself in an endless loop of prompt tweaking, and at that point, I'd rather write the code myself with&lt;br&gt;
the occasional help of Copilot.&lt;/p&gt;

&lt;p&gt;Professional coders know what code they want, we can define it with unit-tests, &lt;strong&gt;we don't want to endlessly tweak the prompt.&lt;br&gt;
Also, we want it to work in the larger context of the project, not just in isolation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this article I am going to introduce a pretty new approach (at least in literature), and a Python library that implements it:&lt;br&gt;
a tool that generates code &lt;strong&gt;from&lt;/strong&gt; unit-tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My basic intuition was this: shouldn't we be able to drastically speed up the generation of valid programs, while&lt;br&gt;
ensuring correctness, by using unit-tests as reward function for a search in the space of possible programs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I looked in the academic literature, it's not new: it's reminiscent of the&lt;br&gt;
approach used in DeepMind FunSearch, AlphaProof, AlphaGeometry and other experiments like TiCoder: see Research Chapter for pointers to relevant papers.&lt;br&gt;
Writing correct code is akin to solving a mathematical theorem. We are basically proving a theorem&lt;br&gt;
using Python unit-tests instead of Lean or Coq as an evaluator.&lt;/p&gt;

&lt;p&gt;For people that are not familiar with Test-Driven development, read here about &lt;a href="https://en.wikipedia.org/wiki/Test-driven_development" rel="noopener noreferrer"&gt;TDD&lt;/a&gt;&lt;br&gt;
and &lt;a href="https://en.wikipedia.org/wiki/Unit_testing" rel="noopener noreferrer"&gt;Unit-Tests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foav6r1ygobdnl0ba7fju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foav6r1ygobdnl0ba7fju.png" alt="Image description" width="800" height="633"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;I've implemented this idea in a Python library called Unvibe. It implements a variant of Monte Carlo Tree Search&lt;br&gt;
that invokes an LLM to generate code for the functions and classes in your code that you have&lt;br&gt;
decorated with &lt;code&gt;@ai&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unvibe supports most of the popular LLMs: Ollama, OpenAI, Claude, Gemini, DeepSeek.&lt;/p&gt;

&lt;p&gt;Unvibe uses the LLM to generate a few alternatives, and runs your unit-tests as a test runner (like &lt;code&gt;pytest&lt;/code&gt; or &lt;code&gt;unittest&lt;/code&gt;).&lt;br&gt;
&lt;strong&gt;It then feeds back the errors returned by failing unit-test to the LLMs, in a loop that maximizes the number&lt;br&gt;
of unit-test assertions passed&lt;/strong&gt;. This is done in a sort of tree search, that tries to balance&lt;br&gt;
exploitation and exploration.&lt;/p&gt;

&lt;p&gt;As explained in the DeepMind FunSearch paper, having a rich score function is key for the success of the approach:&lt;br&gt;
You can define your tests by inheriting the usual &lt;code&gt;unittests.TestCase&lt;/code&gt; class, but if you use &lt;code&gt;unvibe.TestCase&lt;/code&gt; instead&lt;br&gt;
you get a more precise scoring function (basically we count up the number of assertions passed rather than just the number&lt;br&gt;
of tests passed).&lt;/p&gt;

&lt;p&gt;It turns out that this approach works very well in practice, even in large existing code bases,&lt;br&gt;
provided that the project is decently unit-tested. This is now part of my daily workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Use Copilot to generate boilerplate code&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Define the complicated functions/classes I know Copilot can't handle&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Define unit-tests for those complicated functions/classes (quick-typing with GitHub Copilot)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use Unvibe to generate valid code that pass those unit-tests&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It also happens quite often that Unvibe find solutions that pass most of the tests but not 100%: &lt;br&gt;
often it turns out some of my unit-tests were misconceived, and it helps figure out what I really wanted.&lt;/p&gt;
&lt;h2&gt;
  
  
  Example: how to use unvibe
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;pip install unvibe&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Decorate the functions/classes you want to generate with &lt;code&gt;@ai&lt;/code&gt;. Type annotations and proper comments will help the LLM figure out what you want. For example
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# list.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;unvibe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;


&lt;span class="nd"&gt;@ai&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;A simple lisp interpreter implemented in plain Python&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now let's define a few unit-tests to specify the behaviour of the function.&lt;br&gt;
Here Copilot can help us come up quickly with a few test cases.&lt;br&gt;
As you can see, we are looking to implement a Lisp interpreter that supports basic&lt;br&gt;
python functions (e.g. range) and returns python-compatible lists.&lt;/p&gt;

&lt;p&gt;This simple Lisp interpreter is a good example because it's the sort&lt;br&gt;
of function that LLMs (even reasoning models) cannot generate from scratch, but they can&lt;br&gt;
get there with Unvibe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_lisp.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unvibe&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;lisp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lisp&lt;/span&gt;


&lt;span class="c1"&gt;# Here you can instead inherit unvibe.TestCase, to get a more granular scoring function
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LispInterpreterTestClass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unvibe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_calculator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(+ 1 2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(* 2 3)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_nested&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(* 2 (+ 1 2))&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(* (+ 1 2) (+ 3 4))&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(list 1 2 3)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_call_python_functions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(list (range 3)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(sum (list 1 2 3)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3) Now run &lt;code&gt;unvibe&lt;/code&gt; and let it search for a valid implementation that passes all the tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m unvibe lisp.py test_lisp.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unvibe will run until it reaches a maximum number of iterations or until it finds a solution that passes all the tests, in which case it will stop early.&lt;br&gt;
The output will be written to a local file &lt;code&gt;unvibe_lisp_&amp;lt;timestamp&amp;gt;.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Unvibe Execution output.
# This implementation passed all tests
# Score: 1.0
# Passed assertions: 7/7 
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lisp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; ( &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; ) &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SyntaxError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unexpected EOF&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Remove ')'
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SyntaxError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unexpected )&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;list&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Call Python functions
&lt;/span&gt;                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;globals&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;

    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, Unvibe has found a valid implementation. At this point, in my typical workflow, I would now add more tests&lt;br&gt;
and eventually let Unvibe find other solutions.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;For more details, se the &lt;a href="https://github.com/santinic/unvibe" rel="noopener noreferrer"&gt;🐙 GitHub Project page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To configure unvibe, create in your project folder a &lt;code&gt;.unvibe.toml&lt;/code&gt; file with the following example configuration.&lt;br&gt;
You can use your local Ollama or an OpenAI, Anthropic, Gemini or DeepSeek API key.&lt;/p&gt;

&lt;p&gt;I've found that the latest Claude works well for generating complex code, and it's also one of the cheapest and with the largest context.&lt;br&gt;
I've seen also all my benchmark tests pass with qwen2.5-coder:7b running locally on my Macbook via Ollama, but in general larger models are better (&amp;gt; 20B params).&lt;br&gt;
So probably, you don't want to run the LLM locally for this use, unless you have a solid GPU.&lt;br&gt;
To prevent the same code to be generated multiple times, you can use a cache provider. By default unvibe uses&lt;br&gt;
a cash on disk to prevent the same code to be generated multiple times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# .unvibe.toml&lt;/span&gt;

&lt;span class="nn"&gt;[ai]&lt;/span&gt;
&lt;span class="py"&gt;provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claude"&lt;/span&gt;  &lt;span class="c"&gt;# or "ollama", "openai", "gemini"&lt;/span&gt;
&lt;span class="py"&gt;api_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sk-..."&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claude-3-5-haiku-20241022"&lt;/span&gt; &lt;span class="c"&gt;# try also 'claude-3-7-sonnet-20250219'&lt;/span&gt;
&lt;span class="py"&gt;max_tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;

&lt;span class="c"&gt;#[ai]&lt;/span&gt;
&lt;span class="c"&gt;#provider = "ollama"&lt;/span&gt;
&lt;span class="c"&gt;#host = "http://localhost:11434"&lt;/span&gt;
&lt;span class="c"&gt;#host = "http://host.docker.internal:11434" # if running from Docker&lt;/span&gt;
&lt;span class="c"&gt;#model = "qwen2.5-coder:7b"  # try also "llama3.2:latest", "deepseek-r1:7b"&lt;/span&gt;

&lt;span class="nn"&gt;[search]&lt;/span&gt;
&lt;span class="py"&gt;initial_spread&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;     &lt;span class="c"&gt;# How many random tries to make at depth=0.&lt;/span&gt;
&lt;span class="py"&gt;random_spread&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;       &lt;span class="c"&gt;# How many random tries to make before selecting the best move.&lt;/span&gt;
&lt;span class="py"&gt;max_depth&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;          &lt;span class="c"&gt;# Maximum depth of the search tree.&lt;/span&gt;
&lt;span class="py"&gt;max_temperature&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;     &lt;span class="c"&gt;# Tries random temperature, up to this value.&lt;/span&gt;
&lt;span class="py"&gt;max_minutes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;        &lt;span class="c"&gt;# Stop after 60 minutes of search.&lt;/span&gt;
                        &lt;span class="c"&gt;# Some models perform better at lower temps, in general&lt;/span&gt;
                        &lt;span class="c"&gt;# Higher temperature = more exploration.&lt;/span&gt;
&lt;span class="py"&gt;cache&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;            &lt;span class="c"&gt;# Caches AI responses to a local file to speed up re-runs and&lt;/span&gt;
                        &lt;span class="c"&gt;# save money.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Which models work best
&lt;/h2&gt;

&lt;p&gt;The models that I've found work best, are either small coding models (~7B params), or large generic models (&amp;gt;20B params).&lt;/p&gt;

&lt;p&gt;My favourite for models for Unvibe are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;qwen2.5-coder:7b&lt;/code&gt;: It's available on Ollama, it runs great on my Macbook M2, and it's very good at coding.&lt;/li&gt;
&lt;li&gt;Claude Haiku: probably the cheapest model that is good at coding.&lt;/li&gt;
&lt;li&gt;Claude Sonnet 3.7: probably the best coding model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reasoning models can help sometimes, but in practice they are slower. I prefer to try my luck with small models on my Mac and then switch to Sonnet 3.7 if I don't get good results.&lt;br&gt;
As a future improvement it would be nice to support multiple models, and have Unvibe swap between them if the score plateaus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandboxing
&lt;/h2&gt;

&lt;p&gt;By default, Unvibe runs the tests on your local machine. This is very risky, because you're running code generated by an LLM: it may try anything.&lt;br&gt;
In practice, most of the projects I work on, run on Docker, so I let Unvibe run wild inside a Docker container, where it can't do any harm.&lt;br&gt;
This is the recommended way to run Unvibe.&lt;br&gt;
Another practical solution is to create a new user with limited permissions on your machine, and run Unvibe as that user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Search algorithm
&lt;/h2&gt;

&lt;p&gt;There are multiple approaches to search the space of possible programs. With Unvibe I tried to strike for something&lt;br&gt;
that works in practice without requiring a GPU cluster. The idea was that it should be practical to run it only using a Macbook and Ollama.&lt;/p&gt;

&lt;p&gt;DeepMind's FunSearch, which was designed to find mathematical discoveries via program search, uses a variation of Genetic programming with LLMs.&lt;br&gt;
It splits the code generation attempts into Islands of clusters, and let the clusters evolve (via partial code substitution), and then samples from the clusters with higher&lt;br&gt;
scores. They basically sample from the clusters with probability = Softmax of the scores (corrected by some temperature).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqcd3lpqn06kb6kwqbbz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqcd3lpqn06kb6kwqbbz.png" alt="Image description" width="800" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;FunSearch is supposed to work on large datacenters. For Unvibe instead I use currently a much simpler tree search: We start with a random initial tree spread,&lt;br&gt;
attempting different LLM temperatures and then, we pick the most promising nodes and try again. This is much more suitable for running on a Macbook.&lt;br&gt;
So you end up with a search tree that looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9osxwc28nx97nox0wuz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9osxwc28nx97nox0wuz.png" alt="Image description" width="800" height="662"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a live UI I'm working on, to explore the Search tree as it progresses. Likely coming in Unvibe v2. You can already give it a try&lt;br&gt;
by passing the parameter &lt;code&gt;-d&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  TODO
&lt;/h2&gt;

&lt;p&gt;I will extend unvibe with more features soon, and I'm actively looking people to help write Typescript/Java/C#/etc. versions.&lt;br&gt;
Some features I'm working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTML based UI to explore the search graph and look at the reward function rise. Some is already implemented:
for nested code it's nice to see in real-time if the LLM is making progress.&lt;/li&gt;
&lt;li&gt;Support multiple LLMs, and have Unvibe swap between them if the score plateaus.&lt;/li&gt;
&lt;li&gt;Support for other programming languages&lt;/li&gt;
&lt;li&gt;Integration with Pytest&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Research
&lt;/h2&gt;

&lt;p&gt;Similar approaches have been explored in various research papers from DeepMind and Microsoft Research:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.nature.com/articles/s41586-023-06924-6" rel="noopener noreferrer"&gt;FunSearch: Mathematical discoveries from program search with large language models&lt;/a&gt; (Nature)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://arxiv.org/pdf/2502.03544" rel="noopener noreferrer"&gt;Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2&lt;/a&gt; (Arxiv)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/" rel="noopener noreferrer"&gt;AI achieves silver-medal standard solving International Mathematical Olympiad problems&lt;/a&gt; (DeepMind)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://arxiv.org/abs/2404.10100v1" rel="noopener noreferrer"&gt;LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation&lt;/a&gt; (Axiv)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Original Article: &lt;a href="https://claudio.uk/posts/unvibe.html" rel="noopener noreferrer"&gt;Unvibe: Generate code that passes Unit-Tests&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Project: &lt;a href="https://github.com/santinic/unvibe" rel="noopener noreferrer"&gt;Unvibe&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>tdd</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
