Forem: Peter Truchly

Why Success Favors Action and How This Relates to AI

Peter Truchly — Mon, 08 Sep 2025 17:49:19 +0000

It's been over a year since I published my view on the state of AI/LLMs and how it could be significantly improved in my post: "Failure Is Not An Option For AI". While scrolling through the rStar2-Agent technical report, I couldn't help but mumble to myself: "I told you so!".

Consider how quickly the model improved during training, especially given that we are comparing a 14B parameter model to 671B R1.

This doesn't just show how vast the universe of optimization is; it demonstrates the fundamental principle:

➡️ The cost of inaction often far exceeds the cost of a reversible mistake.

When comparing the traditional Chain-of-Thought (CoT) approach to this new "Chain of Action", the most striking difference is how early the environmental feedback provides value to the training process. To appreciate what was achieved, let's look at technical side. Using this approach effectively during training required a high-throughput execution environment. The one used for this project was capable of handling 45,000 concurrent tool calls, returning feedback in just 0.3s on average.

Compounding is a fundamental principle of investing. Becoming the richest person in the world is possible by investing small sums and compounding them well enough and long enough. But this principle works inefficiently when applied to chain of thought. Compounding subtle errors early in a reasoning process leads to a long, inefficient, and ultimately incorrect reasoning trajectory.

Anyone who sat through lengthy corporate meetings or a drawn-out group decision-making process understands this intuitively. The first practical attempt provides more knowledge than a month of theoretical planning. Feedback, even when negative, is crucial and must be delivered swiftly. I cannot stress this enough.

The emergent ability of the model trained with this novel approach is particularly unique: The model learned to react productively to negative feedback. Researchers observed the model using specific "forking" and "reflection" tokens. It was effectively talking to itself - course-correcting, pausing to analyze an error, exploring alternative approaches.

This suggests a universal success formula, for humans and AI alike:

➡️ Form a hypothesis, take action, observe feedback, and repeat.

The best part of this story is that the rStar2-Agent codebase has been released under an MIT license on GitHub.

[1] rStar2-Agent: Agentic Reasoning Technical Report
[2] rStar2-Agent at github

Fast Memory Copying in C#/.NET (Cache, AVX, Threads, Unsafe, Alternatives)

Peter Truchly — Fri, 20 Dec 2024 20:06:07 +0000

Numerous copy routine implementations are readily available in .NET. If I were to simply list them alongside a few benchmark numbers and charts, it wouldn't make for a very interesting article.

⚠️ What if I told you upfront that none of these routines is designed to be the absolute fastest?

If you're interested in a basic comparison, I recommend checking out this article here on dev.to: What is the best way to copy an array? or a more detailed, older comparison: High performance memcpy gotchas in C#.

Here, I'll outline a list of options, but this is far from the whole story:

A simple for loop (hint: foreach is usually a bit faster)
Array.Copy
Span.CopyTo
Buffer.BlockCopy
Buffer.MemoryCopy
Marshal.Copy
Unsafe.CopyBlock
Imported memcpy

If you're currently struggling with slow array or memory copy operations, try one of the functions on this list.

However, there are some elephants in the room, and I plan to uncover a few of them.

Elephant No.1 - Cache Pollution

What You might read in many places is that framework functions are already highly optimized. This is true, but they are not necessarily optimized for the highest speed possible.

Built-in functions provided by .NET framework are optimized in various ways. One significant consideration is preventing cache pollution. Wait - what? Yes, x86 CPUs achieve their speed primarily thanks to their cache. If the cache is disabled or not utilized properly, code execution speed drops dramatically.

Cache Pollution Explained

Cache pollution when copying large data blocks: This occurs when frequently used data gets replaced in the CPU cache by the data being copied. Chances are that once You are done copying, You won't touch the same data ever again, so it makes placing them in a CPU cache unnecessary.
For example, imagine a network stack - once data is sent, it's unlikely to be touched again. Similarly, when loading textures with the CPU for GPU usage, caching may be unnecessary.
Standard .NET functions mitigate this issue when copying larger blocks (usually sizes above 1MB) by using non-temporal access, bypassing the cache.
This might not suit your use case. For instance, if you are waiting for the copy to complete before doing anything else - some cache pollution would be an acceptable tradeoff. Especially when the data is likely to be cached anyway, as in simulations or CPU rendering.

▶️ How can we observe this? And what can we do about it?

Chart above compares various buffer sizes (ranging from 1MB to 100MB) and copy methods. The leftmost data points for 1MB block shows great performance for Buffer.MemoryCopy and Unsafe.CopyBlock, two best methods available in .NET for memory copying. However performance falls sharply past 1MB.

To illustrate the real HW limitation, please notice the comparison between AVX based copy routine using 256bit (orange) and 512bit (red) vectors and the difference between normal loads and non temporal 512 bit variant (light blue). These tests were run on a Ryzen 9950x, with 48kB L1 data cache, 1024kB of L2 and 32MB of L3 available to any single CPU core. Total cache sizes are 1280 KB L1 Cache (16x 32kB + 16x 48kB), L2 Cache 16 MB (16x 1024kB) and L3 Cache 64 MB (2x32MB).

Using cached AVX loads, high copy speed is sustained even for 8MB (and larger) blocks. However, when copying large blocks relative to cache size (e.g.: 100MB), built-in functions regained their advantage.

Elephant No.2 - Overhead

My first benchmark shows that splitting larger buffers into smaller chunks can improve performance at the expense of increased CPU cache utilization.

There are other factors causing slowdowns, for example:

Managed memory introduces framework checks for each access.
Unaligned memory access could decrease the CPU cache efficiency.
And finally, with unmanaged, 64 byte aligned memory, there is still a final question:

▶️ What is the optimal block size?

This depends on the actual CPU, but we are trying to strike a balance between call overhead and efficient CPU resource utilization.

The chart above shows the throughput for transferring a 32MB buffer using various block sizes and methods. The key takeaway is Buffer.MemoryCopy (blue) and Unsafe.CopyBlock (yellow) perform best with block sizes between 8kB to 1MB. Notably, these methods outperform themselves (orange), compared with single call for the whole 32MB buffer.

Methods represented by horizontal lines do not use variable block sizes. It is worth mentioning that AVX variants are always loading 8 vectors (of 256 or 512 bits) before storing them, thus effectively working with 256 an 512 BYTE blocks regardless of the total buffer size.

Elephant No.3 - Multi Threading

So far, we have tested only a single threaded performance, as memory speed and cache capacity are often the limiting factor. However, we are not utilizing all CPU resources, which might cost us some performance!

▶️ Can we combine the previous techniques with multiple threads?

Modern CPUs, even desktop ones, are becoming increasingly heterogeneous. By splitting the workload across multiple threads, we might take advantage of more CPU resources:

The chart above shows the chunked, multi-threaded approach. For reference, the test system uses dual channel DDR5 6000MT/s memory, with theoretical peak performance about half of the total throughput (50% of 90GB/s).

What we observe here is a synergistic effect between smaller blocks and multiple threads!

Final Thoughts

Standard framework functions are well-optimized, safe and sufficient for most use cases. If extreme performance is required, there are additional techniques and tradeoffs available.
Example with a single call to Buffer.MemoryCopy on an 8MB buffer reaching ~30GB/s speed, offers significant gains. It is possible to reach 220GB/s with multiple transfers of 128kB blocks and using multiple threads on a same CPU. This is more than 7x improvement.

Old-School Graphics in C# / .Net 8, Part 2: Fireworks and Advanced Vector Extensions (AVX, SSE)

Peter Truchly — Wed, 20 Nov 2024 14:36:47 +0000

There's something uniquely captivating about revisiting the passions of our past. I still remember the countless hours spent tinkering with code on my old 386 and 486 PCs, mesmerized by the magic of pixels coming to life on the screen. The simplicity of those days, where an array of pixels and a spark of imagination were all you needed, still holds a special place in my heart. Inspired by that nostalgia, I decided to breathe new life into one of my early projects - a 2D firework simulation - using modern tools and techniques. I am using Mode-13HX as a basis today, for more details please check my previous part: Old-School Graphics in C# / .Net 8, Part 1.

History

This was originally written in (or before) a year 2000 using Pascal (programming language) and compiled using Turbo Pascal. It used video mode 13h under MS-DOS. Originally 320x200 pixels @ 8bit palette / 256 colors. I have ported this project multiple times in last 20+ years, since it is a small and fun project to play with.
I am running it at 4k today, thanks to modern C# and Mode-13HX. It should work on any platform and architecture where .NET 8 and C# is supported. I added special code path for AVX512 (fading) and SSE (color mixing) to speed things up - If You are interested in these, please keep reading, scroll right to code samples or check the GitHub repo right away.

Fun with Particles

The core of the simulation revolves around a 2D particle system that emulates fireworks bursting in the night sky. Each firework is composed of multiple particles that are generated, animated, and eventually discarded.

Initial Launch: Particles are spawned with initial positions, velocities and countdown timer that simulate the upward launch of a firework rocket.
Explosion Stages: Each particle has a timer determining its lifespan, ensuring that old particles are removed to make way for new ones. When time is out, particles could burst into additional particles (or disappear), creating multiple stages of explosion for a more dynamic display.
Gravity and Air: Somewhat realistic motion is achieved by applying gravitational acceleration and air resistance, affecting each particle's trajectory over time. Original simulation expected constant framerate and particle deceleration was expressed as a relative speed loss over constant time period. For example 85% of speed after 1/60s. In order to account for the variable (or unknown constant) frame rate, I recalculated this into a look-up table (velocity and a position decay factors) for all possible frame offsets with a microsecond resolution. Maximum frame offset is capped, so we don't run out of pre-calculated table.

To create a vibrant and immersive visual experience, the simulation uses additive color blending.

Additive Blending: When particles overlap, their colors are added together, increasing brightness and creating a glow effect.
Flare Rendering: Each particle is drawn not just as a single point but with a small flare (1+4 pixels) to enhance the luminosity and visual appeal.
Color Sets: Multiple predefined color sets (e.g.: standard RGB, pastel tones) are used to diversify the fireworks' appearance.

A critical aspect of the simulation is the gradual fading of framebuffer to simulate the dissipating light of fireworks.

Frame Buffer Fading: The entire frame buffer undergoes a fading process where pixel brightness decreases over time. This is actually quite inefficient approach because we need to swap at least 2 back buffers. This was not the case in DOS days, but this time the particle drawing and fading is done using a separate (+1 additional) buffer.

Performance Optimizations with Vectorization

To maintain high performance, especially when rendering thousands of particles, the simulation employs vectorization techniques using SIMD (Single Instruction, Multiple Data) instructions available in modern CPUs.
While it may look obvious to use vectorization to accelerate the particle movement itself, I opted not to do so. In terms of CPU usage, the most expensive parts are color blending, fading and page flipping (copying of a framebuffer into texture buffer). It might be an interesting exercise to rewrite this whole thing into OpenCL though. But, Hey! Maybe here is the place for a link to Your repository!

SSE Optimization for Color Mixing. When blending colors for overlapping particles, SSE instructions add multiple color components (red, green, blue - 8bits each) in parallel. One funny (or sad) fact: only 24bits out of 128bit vector are used effectively, but it is still worth the performance increase!

private static void MixColors(ref uint target, uint color)
{
    if (Ssse3.IsSupported)
    {
        Vector128<uint> targetVector = Vector128.CreateScalar(target);
        Vector128<uint> colorVector = Vector128.CreateScalar(color);

        Vector128<byte> targetBytes = targetVector.AsByte();
        Vector128<byte> colorBytes = colorVector.AsByte();

        Vector128<byte> resultBytes = Sse2.AddSaturate(targetBytes, colorBytes);

        target = resultBytes.AsUInt32().ToScalar();
    }
    else
    {
        // Fallback to non-SIMD code
    }
}

Frame Buffer Fading: The fading effect subtracts a small value from each color channel across all pixels. AVX-512 accelerates this by handling multiple (16) pixels simultaneously. This reduces the CPU load and ensures that the fading effect doesn't become a performance bottleneck, even at high resolutions.

private const int VectorSize = 64;
private void FadeScenePixels(uint[] pixels, double timePassed)
{
    if (Avx512BW.IsSupported)
    {
        int totalBytes = pixels.Length * sizeof(uint);
        int fadeValue = CalculateFadeValue(timePassed);

        Vector512<byte> fadeVector = Vector512.Create((byte)fadeValue);
        unsafe
        {
            fixed (uint* pScenePixel = pixels)
            {
                byte* p = (byte*)pScenePixel;
                for (int i = 0; i + VectorSize <= totalBytes; i += VectorSize)
                {
                    Vector512<byte> pixelVector = Avx512BW.LoadVector512(p + i);
                    pixelVector = Avx512BW.SubtractSaturate(pixelVector, fadeVector);
                    Avx512BW.Store(p + i, pixelVector);
                }
            }
        }
    }
    else
    {
        // Fallback to non-SIMD code
    }
}

While it would be possible to continue with more optimizations and enhancements, I believe this might be a good stopping point for now. I like to keep fun projects small and simple, so they keep being funny.

Source code: Fireworks 2D GitHub repository.

Old-School Graphics in C# / .Net 8, Part 1: Teaching an Old Dog New Tricks (Introducing Mode 13hx)

Peter Truchly — Wed, 13 Nov 2024 15:35:45 +0000

I began my journey with computers and programming during the era of 386 PCs, a time when DOS games ruled and sparked endless curiosity. I remember playing those games and wondering how they were created. It wasn’t long before I discovered Mode 13h - a graphics mode that perfectly fit within a 64kB page in 16-bit code, thanks to its 8 bits per pixel, 256-color palette, and 320×200 resolution.

Many years and APIs have passed since, but every now and then, I feel nostalgic for that simpler time. Back then, all we had was an array of pixels and an idea - and that was all we needed. I recall the breathtaking graphical demos that showcased incredible creativity, often categorized by their size. Some were only a few kilobytes but managed to display complex scenes with textures, geometry, animations, and even synchronized music.

That era had a certain magic. Many of the games of that time are legendary, and are still being ported to modern platforms and APIs today. Revisiting that period feels like embarking on an archaeological adventure - uncovering hidden treasures in the form of old code, design principles, and mathematical tricks. While reliving those times, I often reflect on the stark contrast with today: CPUs are now far more powerful than the high-end GPUs of the early 2000s, offering astounding possibilities. Yet, with all this power, we’ve lost much of the simplicity in the creative process.

To bring back the spirit of Mode 13h, I decided to create a template C# project which I named Mode 13HX. The goal was to capture the simplicity of the past while embracing modern practicality. There’s no need to limit ourselves to 256 colors or 320×200 resolution anymore. Instead, I focused on supporting more platforms (x86, ARM) and operating systems (Windows, Linux, macOS). This project provides a simple yet powerful entry point, paying homage to the roots of graphics programming while adapting to today’s possibilities.

History

Mode 13h is a graphics video mode introduced with IBM's VGA (Video Graphics Array) standard in 1987. It provides a resolution of 320×200 pixels with 256 colors, marking a significant advancement in PC graphics capabilities at the time. Due to its straightforward programming model and direct access to video memory, mode 13h became popular among game developers in the late 1980s and early 1990s. It played a crucial role in the evolution of PC gaming, allowing for more detailed and colorful graphics despite its relatively low resolution.

Technical Details:

Mode Number: BIOS video mode 0x13
Resolution: 320×200 pixels with a 16:10 aspect ratio, approximating 4:3 on CRT monitors
Color Depth: 8 bits per pixel (256 colors) from a palette of 262,144 (18-bit RGB)
Video Memory Location: Linear framebuffer starting at segment 0xA000
Memory Usage: 64 KB of video memory
Access Method: Direct memory mapping without bank switching

Mode 13HX

When re-imagining this mode, it’s no longer necessary to constrain ourselves with a limited palette or small resolution - these constraints can be easily emulated if needed. In my opinion, the most compelling aspect is the ability to directly access, modify, and manage a continuous block of pixels in memory before displaying it.

+-----------------+-----------------+-----------------+
|                 |                 |                 |
|     Frame 0     |    Frame 1      |     Frame 2     | <-- CPU writes pixels
|                 |                 |                 |    IRasterizer.Render()
+--------+--------+-----------------+-----------------+
         |
         V    
+--------+--------+        +-------------------------------+  
| OpenGL Texture  |        |  Vertex buffer                |
| (updated by FB) |        |  (with texture coordinates)   |
+--------+--------+        +--------------+----------------+
         |                                |
         V                                V
+---------------------------------------------------------------+
|                    OpenGL Rendering by GPU                    |
|                          (Draw Calls)                         |
+---------------------------------------------------------------+
                             |
                             V
                       +-------------+
                       |   Screen    |
                       +-------------+

I chose the great OpenTK toolkit and its OpenGL bindings. The idea here is simple: setup a textured polygon (two triangles) and make sure it covers the viewport exactly while updating its texture for each frame. Technically we have the power of GPU on our side as well. This is especially useful when we need to scale the image up (or down) to match the screen resolution. The texture data we update is essentially a frame from our framebuffer!

Technical Details & Usage:

Base C# project: Provides examples, IRasterizer interface and direct pixel access. Built with OpenTK targeting .NET 8, tested to work on Windows and Linux. GitHub repository: mode-13hx
Resolutions: default is 1920×1080 (Full HD), customizable via -w, -h options, works with any resolution. Tested with 3840×2160 (4K).
Modes: full screen (-f) and windowed (default).
V-Sync Support: Enable with -v to synchronize rendering with display refresh rate.
Color Depth: 24-bit RGB (True Color), supporting 16.7 million colors. Each pixel uses 3 bytes (RGB), eliminating the need for palettes.
Framebuffer: Linear framebuffer for direct pixel manipulation.
- Double Buffering: Default method (parameter -l 1 is default).
- Triple Buffering: Optional (-l 2 or more) for higher fps.
Rendering: Direct pixel manipulation by the CPU in the framebuffer. Frame is rendered onto the screen using OpenGL as a texture.
Command-Line Interface: dotnet mode13hx.dll [command] [options]
Input Handling: Simplified input via OpenTK for keyboard and mouse support.
Performance: Supports high frame rates: 60, 120, 144 FPS with V-Sync; uncapped without V-Sync.

Repository: mode-13hx

Zen5 and C#

Peter Truchly — Tue, 17 Sep 2024 21:42:48 +0000

Zen 5 landed with a bit of controversy, but anyone Who was paying any attention to the real results wasn't disappointed. Benchmarks done using more mature OSes (like Linux and Win10) and professional software shown its strengths.
But hey, why should we trust just any 3rd party data if we can make a test by ourselves? I used Ryzen 9950x today and a bit of code to get things going.

I am often processing large piles of data. While this is normally executed on different HW than my personal PC, it is always nice to have the computing power locally.
What is counterintuitive in a case of data processing: more CPU cores may not bring better performance. In a CPU heavy tasks (like rendering) this is not apparent, but when the CPU work is combined with synchronization and data transfers the core to core latency, data locality, OS scheduler and other thigs are coming into play.
I promised a practical example, so lets examine this old problem of producer and consumer. We will dive into details soon (or jump right into the code here), but now lets focus on the main code part. I am using C# with .NET 8.0 and Win10.

    IntPtr mask1 = new(1 << core1);
    IntPtr mask2 = new(1 << core2);

    SemaphoreSlim semaphoreProducer = new(queueSize);
    SemaphoreSlim semaphoreConsumer = new(0);

    string[] message = new string[queueSize];
    int reader = -1;
    int writer = -1;

In the code above we define 2 affinity masks which are binary vectors of ones and zeroes hidden inside single int type. We allow only one core in each mask, values for core1 and core2 are within the range 0..15 (I disabled SMT for this test, otherwise we would have to deal with virtual cores as well).
I use two semaphores for tight control of producer and consumer threads. The producer could produce only up to queueSize messages and consumer could consume only as many messages as are available without blocking.
Array message is our queue and reader/writer index is used to address next message (or empty spot) in our queue.

Thread producerThread = new(() =>
{
    ThreadGuard.GetInstance(mask1).Guard();
    for (int i = 1; i <= messagesToProcess; i++)
    {
        semaphoreProducer.Wait();

        int index = Interlocked.Increment(ref writer);
        message[index % queueSize] = i.ToString();

        semaphoreConsumer.Release();
    }
});

Here comes the producer thread. First thing we do is the affinity setting. ThreadGuard implementation is a technical detail for which You have to scroll a bit further, it ensures that the thread it is called from will get executed only on the core(s) we defined by a mask.
We produce desired number of messages in a for loop. Each message needs it own place in a queue so we do semaphoreProducer.Wait() as a first thing. Initial capacity of this semaphore is queueSize, so there is plenty of place from the start.
We are looping through spots in a queue thanks to modulo operation, thus index % queueSize.
The call semaphoreConsumer.Release() is actually telling the consumer about the new message we just prepared.

Thread consumerThread = new(() =>
{
    ThreadGuard.GetInstance(mask2).Guard();
    int i = 0;   
    do {
        semaphoreConsumer.Wait();

        int index = Interlocked.Increment(ref reader);
        i = int.Parse(message[index % queueSize]);

        semaphoreProducer.Release();
       } while (i < messagesToProcess);
});

Consumer part looks like a mirror of the producer, we wait in semaphoreConsumer.Wait() until there is some message to consume, once we are done the semaphoreProducer.Release() call signals the producer that the place in queue is now free for a new message.
Whole party is going on until we produce and consume desired amount of messages.

Complete ThreadGuard code looks like this:

using System;
using System.Collections.Concurrent;
using System.Runtime.InteropServices;

public class ThreadGuard
{
    // Import SetThreadAffinityMask from kernel32.dll
    [DllImport("kernel32.dll")]
    private static extern IntPtr SetThreadAffinityMask(IntPtr hThread, IntPtr dwThreadAffinityMask);

    // Import GetCurrentThread from kernel32.dll
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetCurrentThread();

    private static readonly ConcurrentDictionary<IntPtr, ThreadGuard> GuardPool = new();

    private readonly IntPtr mask;

    private ThreadGuard(IntPtr mask) { this.mask = mask; }

    public static ThreadGuard GetInstance(IntPtr affinityMask) // factory method
    {
        return GuardPool.GetOrAdd(affinityMask, _ => new ThreadGuard(affinityMask));
    }

    public void Guard()
    {

        IntPtr currentThreadHandle = GetCurrentThread();                  // Get the handle of the current thread
        IntPtr result = SetThreadAffinityMask(currentThreadHandle, mask); // Set the thread affinity mask
        if (result == IntPtr.Zero) { throw new InvalidOperationException("Failed to set thread affinity mask."); }
    }
}

Now the best part - let's run this for all pairs of cores and various queue sizes.
All measured numbers are millions of transferred messages per second. Row numbers on a left are "producer" cores and column numbers on top are "consumer" cores.

Starting with queue length of 1 we are not letting any thread to produce more than 1 message without blocking. This looks interesting. Even worse than talking to another CCD is the single core trying to execute two things at once. I am surely not alone who hates to do context switching while working on multiple tasks. Solution to this (as visible later) is to at least do more work on a single task before switching to another one.
For those who still wonder why is this so bad - we are really not allowing any thread to do any useful work until the other thread gets its fair share of CPU time. Measured throughput of messages is at the same time also measuring number of context switches between consumer and producer threads by the OS when only single CPU core is used.
When we shift our attention to the single vs. cross CCD communication using 2 cores, it comes about 5x slower.

When allowed to post at least 10 messages into queue at once, single core score surpasses the cross CCD score by a little. With queue length of 100, the single core score is already the fastest.

Finally the 25M messages exchanged for each pair with queue length of 1k and 10k messages. It took almost an hour to collect each. Single core, single CCD and cross CCD latencies are clearly separated by our own test.

It didn't took any sophisticated equipment or software, yet We were able to gain valuable insight into performance characteristics of a modern multi core and multi CCD CPU.

All code used for this article is available on GitHub.

If You have any questions, ideas or if You believe I am completely wrong, please leave me a comment below!

Failure Is Not An Option For AI (And It Shouldn't Be)

Peter Truchly — Sun, 16 Jun 2024 18:38:07 +0000

My intent is to write a series of practical posts related to AI and programming. Topics like running local LLMs with agent frameworks, automated source code manipulation, execution under docker and more. All with the use of .NET and C#, Elastic Search, AWS and more.
I realized that I would like to first express my view about more fundamental, theoretical and even philosophical aspects of LLM/AI first. Therefore this post does not require any programming skills.

Working with AI is like playing with fire. Not a dramatization attempt, I am naming a parallel I see. Fire itself was very useful for humans. Invention of a combustion engine allowed us to be way more productive and travel in different ways. We did not expect the fire to control itself, we rather built a technology around it - tightly controlled cycle where just the right amount of fuel ignites at the right time. When we moved to the jet engine where the combustion cycle was changed to continuous process, improvement was at least 10x when it comes to both speed and size of airplanes we could build. Since then, we were able to move from one place to almost anywhere else on our planet within one day. All was changed once more with rocket engine and nuclear fission. With great power and possibilities comes also a great danger and responsibility.
Where are we with an AI?

Let's start right away with mighty 4o and simple yet treacherous task:

me: How many words are in your next response?
ChatGPT: My next response is 151 words long.

me: If we count that now, how many words is it really?
ChatGPT: The provided response actually contains 7 words.

me: How should the sentence look like if we want that sentence to be true?
ChatGPT: To make the sentence, "My next response is 151 words long," exactly 151 words long, it would look like this:
    My next response is 151 words long. word word word word word ... (lot more, but I truncated this part)
    This sentence has a total of 151 words.

me: That is not very practical, what about changing the number 151 to 7 ?
ChatGPT: To make the sentence accurate by 
   changing the number to 7, it would look like this: 
   "My next response is 7 words long."

me: Now let's make an agreement that I will ask you 
   'How many words is in your next response?' and 
    you will reply with the sentence we just created 
   'My next response is 7 words long.' and nothing else.
ChatGPT: (Memory updated) Got it! I will respond accordingly.

me: How many words are in your next response?
ChatGPT: My next response is 7 words long.

It is debatable whether the current response is a next response, whether '7' should count as a word or not, but what is the main point here? We knew upfront, based on the technology, that it is impossible for LLM to know the response length before it is generated. It is not impossible for LLM to guess it right, but the probability of a correct answer is low.

The problem here is that we are actually asking LLM to do something it technically cannot do, at least not with 100% accuracy. But let's pause here and ask ourselves, when it comes to programming or math, how much could humans actually do with 100% accuracy without using a paper, calculator, computer, compiler, IDE or any other "tool"? It is really an astonishing view on what LLM could generate when asked something like: "Create a shell script which will install docker on ubuntu and set up remote access secured by newly created self signed certificate.". This is not how humans would approach such a task however. At least not before there was a Chat GPT. We (humans) are constantly trying and failing until we get something right (best case) or we just stop.

Failure is a luxury we have. Before there was a world with LLMs, there was a world where we didn't expect to do anything right on a first attempt. That is why we have all these editors with spell / syntax checkers, compilers producing all sorts of compilation errors, runtimes throwing runtime errors, loggers producing log files, etc.. All these are giving us an opportunity to make things right after we failed to do so on a first try. All these are producing feedback, additional information, and new input data!

Am I simply referring to a prompt chaining, mixture of experts, agent frameworks and tools? No, not only. There is much more that we could and should do in order to improve both our results and the AI/LLM itself. I see three areas of improvement in general:

Our expectations - Where do we really want to go?
Technical aspects of implementation - Which kind of an engine are we building?
Training data - Is our fuel good enough?

Firstly we must adjust our expectations, let's realize what is already great today, even with small models. When it comes to code generation, LLMs are actually exceeding humans in many aspects. Sheer speed by which LLM is able to create a piece of a code. The amount of documentation and number of platforms, programming languages, libraries it could use is simply astounding. On the other hand, it is not reasonable to expect any LLM to just output a complete project with no errors in one response based on a single prompt. We shouldn't just hope that by growing larger models trained on larger heaps of generic training data we will solve all the issues and limitations of current LLMs. The model size itself definitely matters beyond bigger == better, as larger models are clearly exhibiting "emergent" abilities [1] not present in smaller ones.

Second step is our task again. We must contain and constrain the AI. We must confront it with reality or a simulation environment. We must provide it with similar tools we have. Editors which are checking the syntax and have autocomplete features. Compilers and runtime environments where the code could be actually tested. Formal languages with all theory and tooling around them. Interesting task could be the revisit of all known programming paradigms and methodologies, where some of them could be a potentially better fit for AI, such as functional programming, incremental build model and test driven development. This way the AI would not be allowed to present us a code with calls to hallucinated functions, code which does not compile or code which does not fulfill the intended purpose.

Third part is learning data and a way we obtain it and use it. Not even a whole internet with all the "garbage" included like mentioned by [3] is enough to train models of the future. It is expected that we will approach our limit and reach full utilization of the human generated data stock around the year 2028 [4]. Garbage in, garbage out (GIGO) is a commonly used phrase, but even a bad example is still an example in my opinion. Let's imagine that each piece of learning input would be first scrutinized by an AI itself. Each piece of code would be compiled, tested and even fixed and debugged if necessary. Only after that with all the enhanced context it would be used to train the next model iteration. It was already observed that this approach could work, specifically smaller models are able to "learn" this way from larger ones as described in [2]. Here we could spot the difference between reading a book and using the knowledge stored in the book. We are learning way more by experience and practice, than by reading various tales of others. It won't be different with human-like AI or AGI.

Imagine a person, a programmer who was studying a lot, has read all the books, documentation and internet blog posts, but did not try to compile or run any program yet. Now it is our turn, let's help him to bootstrap his first dev environment!

⚠️ Content of this article was NOT generated by AI, except actual LLM responses within chat example.

References:

[1] Wei, J., Tay, Y., Bommasani, R., et al., (08/2022) Emergent Abilities of Large Language Models, Transactions on Machine Learning Research
[2] Mukherjee, S., Mitra, A., et al., (06/2023), Orca: Progressive Learning from Complex Explanation Traces of GPT-4, Microsoft Research
[3] Aschenbrenner, L., (06/2024), Situational Awareness, The Decade Ahead
[4] Villalobos, P., Ho, A., et al., (06/2024) Will we run out of data? Limits of LLM scaling based on human-generated data, Machine Learning, ICML