Forem: Veith Röthlingshöfer

Why agents DO NOT write most of our code - a reality check

Veith Röthlingshöfer — Mon, 03 Nov 2025 09:38:49 +0000

At Octomind, we build AI agents - but our code is still mostly written by humans. We love LLMs and use them everywhere we can, from our product to our internal workflows. But despite all the hype, we’re nowhere near “agents writing most of our code.”

We have good reasons for not jumping on the bandwagon of companies like Anthropic (generating 80%), Microsoft (30%) or Google (25%) just yet.

‍

*one of many, source: LinkedIn

A few vital things are “not there” yet. Here’s why it matters - and what it takes to actually close the gap.

Experimenting with coding agents in day to day coding

We’ve been using Cursor, Claude Code, and Windsurf for months, but none of us can honestly say they’ve boosted our productivity in a meaningful way (say, 20% or more). Sure, tab completions are often solid, and I’ve had some success getting them to generate unit tests - especially when there's existing tests to copy from (like new routes).

Still, that’s nowhere near the 80%+ efficiency gains others claim. So, driven by equal parts FOMO and curiosity, my colleague Fabio and I decided to spend the past week implementing a roadmap feature entirely with AI.

Before diving in, we combed through the documentation for our preferred tools to make sure we weren’t missing anything useful. We also updated our Cursor rules and CLAUDE.md file to inject updated knowledge about our product and our dev workflow, enabled BugBot for AI code reviews and went to work.

The feature we tried to build (with AI)

At Octomind, we build an agent-powered end-to-end testing platform. Our tests aren’t tied to branches - they live centrally in our system, which doesn’t support branch-specific versions of test cases. That works fine until you start using branch deployments.

Picture a SaaS app with three tests: login, create post, edit post. The app under test is developed with branch deployments for each pull request (PR). Now imagine a branch that changes the login flow - say, it adds 2FA. The existing login test (which only checks username + password) will now fail, blocking the pipeline for that PR.

‍

At the moment, you’ve got two options:

Remove the failing test so it doesn’t block unrelated PRs, fix it manually (or via AI) to handle the new flow, merge, then re-enable it.
Update the test directly and merge your PR - but now every other dev’s pipeline breaks until you’re done.

Neither is great. One blocks others; the other breaks trust in your merge.

‍

To combat this, we wanted to extend the concept of branches to our tests. When a branch is created, you can spawn a branch-specific copy of a test. That copy runs only for that branch and can be edited freely. When the branch merges, the copy becomes the new default.

We figured this feature should be doable in about a week with two developers.

‍

First try: Running wild

As the first iteration, we let the agents roam. We did not expect this to work perfectly, but we wanted to see where it’s at.

We’ve got a decent-sized monorepo, so “just dump everything into context” isn’t an option. We take testing seriously and have guardrails the AI can use to check its own output.

So I wrote a detailed brief and attached the files it needed into the context. This wasn’t a ‘tiny prompt performing miracle’ thing - I iterated the prompt until it was as specific as possible. Within ~5 minutes, the agent produced a plan with reasonable 11 TODOs:

*Cursor agent produced a coding plan
‍

We hit run and that’s where it went sideways. The agent started cranking out lines, but stumbled on basics any dev would breeze through - like regenerating the Prisma client after a database schema change (yes, the Cursor rules spelled it out).

I went back and forth with it several times for clarification. It reported success with the message: “The feature should now work correctly! The fork button should be functional, and you'll be able to see forked test cases using the new filter. 🎉” while:

Not checking off all the TODOs
Producing nothing that worked on our dev server (which it had access to via computer use)
Ignoring basic coding guidelines we explicitly listed ‍

An incomplete list of misses that wouldn’t trip up a human dev here:

Built a React component for new buttons… and never wired it into existing components
Skipped our standard logging library
Used very inefficient database queries (it made an extra request for every ID that is joined anywhere)
Ignored our naming and structure conventions
Added two new external libs for trivial stuff we already have

and I am not even talking about bugs here. These are just things that immediately stick out and would not have happened to a developer. -

Yes, we tried multiple rounds of “still doesn’t work - you forgot to actually use the new button” and similar nudges. The result: a 2,000-line PR that needs review and rework almost everywhere.

‍

Take two: Smaller, incremental changes

I decided to start over. We never expected this to just work - these agents are supposedly better at smaller features anyway. So I stepped back and thought through how I’d actually build it myself, step by step. This kind of ideation is where I usually do like using an LLM - not to code, but to bounce around approaches in planning mode.
‍

Then I had it make a plan again, this time for just the first piece: loading the correct test from the database given an execution URL.

It produced another long, seemingly sensible plan. I let it run - expecting to step in more often this time. After running through several contexts worth of chat, I had another pull request.

About 1,200 lines of code, just for this one part. Does it work? Well… it typechecks. Beyond that, who knows.

In the spirit of the experiment, I handed the PR to BugBot for review. It left four comments - one of them pointing out that transaction handling was broken. That’s not great for data consistency, so I fed the comment back into the chat.

‍

The AI’s response:

“Of course, massive oversight on my part, I apologize. Let me fix it: …”

Apology accepted, I guess. It did patch a few places, and when I asked if it saw any more issues, it replied:

Recommendation‍

Fix Transaction Consistency: Ensure all database calls within getFullTestCasesWithPrerequisites use the same transaction client. (and … 7 more things)

All of this was in a single file. The only context it needed was the database schema - literally the thing it was just working on.

It also has a habit of making life easy for itself:

It’ll check for linter errors, but only by running head -30 and some regex filters, so it convinces itself everything’s fine.
‍

Confidently marking half-done work as finished, apologizing for “massive oversights,” fixing things only to break others (look up the German word verschlimmbessern), and completely ignoring our existing designs and UX is not even the worst part.

‍

The issues that really matter

1. Loss of mental model

Let’s say the agent can now ship medium-complexity features with little help. And let’s even assume we’ve fixed the “wait 3 minutes, review 1000 lines of output” problem by turning developers into orchestrators of agents instead of coders. That’s the dream many LinkedIn posts are selling.

Even then, a huge issue remains: I lose my mental model of the codebase.

Right now, I know how changing one part affects another, where bugs tend to hide, and how the data model behaves. That intuition disappears when the AI keeps dropping thousand-line PRs that might even get auto-merged. When a teammate does that, I can trust they made thoughtful trade-offs, and I’ll pick up context while reviewing or building on top of it. With AI, that learning loop is gone.

So when a tough bug or edge-case feature comes up - the kind the AI still can’t handle - it feels like I’m seeing the codebase for the first time. Maybe I skimmed some reviews of features I know the AI can do by itself (unless I ran auto-review tools like CodeRabbit), but that’s nowhere near the understanding that comes from actually interacting with the code yourself.

Until I can trust the AI completely, I need to keep my own mental model alive. Otherwise, every time I need to do something myself feels like joining a new company.

‍

2. Absence of self-reflection

AI is currently bad at judging its own performance. Constant replies like these are only annoying if you let them get to you, but boy, is it hard not to:

I asked it to do so before implementing the feature, to which it responded with:

MODERATE CONFIDENCE - I can implement this correctly, but with some caveats:
Overall Assessment: I can implement this task correctly, but it would require careful attention to detail, thorough testing, and potentially some clarification on the missing methods and business logic. The complexity is manageable but not trivial.‍

That sounds like a human engineer’s self-assessment, because the model is parroting human training data. The problem is that it shouldn’t use the training that is made by humans (at least until recently) to judge its own ability, because it is not a human.

And that’s the core issue: the model has no concept of its limits. You only find out if it can do the task by letting it try. An intern can say, “I’ve never done this.” An LLM doing so is very unlikely.

Worse, on our smaller follow-up task it graded itself even higher:

COMPLEXITY: HIGH - This is a substantial refactoring
MY ABILITY: HIGH CONFIDENCE - I can definitely implement this correctly because:

The plan is very detailed and specific about what needs to be changed

I understand the current architecture and data flow

The changes follow established patterns in the codebase

The implementation steps are clearly outlined

The task is definitely implementable and I have high confidence I can complete it correctly - it's just a matter of following the detailed plan step by step and implementing all the missing pieces.

The good parts of coding agents

AI absolutely has a place in a developer’s toolbox. I use ChatGPT or Cursor’s Ask mode daily - to brainstorm, debug, or get unstuck on smaller problems. Tab completions? They’re right about 80% of the time, which is good enough to keep them on. I even let AI handle things like writing unit tests for clean interfaces or refactoring small snippets. Wrapping a loop in Promise.allSettled is boring for me - but trivial and instant for AI. It’s also great at recreating well-known patterns from scratch - like traversing a tree structure.

And for non-technical users, automation powered by AI can be a massive unlock. That’s literally what we work on at Octomind: automating a technical task, but within well-defined boundaries using specialized agents. They don’t write the whole codebase; they handle narrow, observable parts of it where output constraints keep them in check.

Other focused tools can deliver similar value. And sure, maybe one day AI will truly handle everything it’s being credited with today (whether that’s LLMs or something beyond them).
‍

But we’re not there yet - and more people are starting to admit it.

‍
Veith Röthlingshöfer
AI engineer at Octomind

Stop automating manual test cases

Veith Röthlingshöfer — Thu, 16 Jan 2025 16:56:02 +0000

Those that have been watching the testing landscape for a while might remember the craze that occurred when test automation started going mainstream. Will testers lose their jobs? What happens when we automate everything?

Looking back, those concerns sound almost funny. Years of test automation have shown that despite significant speed improvements, companies still combine manual QAs with test automation engineers to help shipping high quality products.
‍
The recent 'shifting left' trend - pushing testing and quality processes to the earlier stages of development - has increased the focus on test automation. Many companies insist on transforming former manual QA teams to test automation and equip everyone with automation skills. The goal is to automate everything that was previously done manually. While there’s a good argument for broadening the technical skills of QA teams, there’s definitely more to test automation than simply creating scripts to automate test cases.

Automating test cases

Let me state it clearly. Automating manual test cases 1:1 is a bad idea.

This idea usually stems from the desire to keep the same test coverage as when tests are performed manually. Simply put, the idea is:

One test case = one test script

But this does not meet with the reality of what test automation really is. Manual checks are not only a series of steps but a series of various qualitative assertions such as visual look of the application under test, exploratory side quests, prompting developers for additional contents and little experiments. Any tester worth their salt will not mindlessly perform a series of test steps and report back only when they’re unable to perform the next one. That’s what makes it more valuable than a test script. On the other side, there is much value in test automation that is worth pursuing.
‍
Test automation makes QA much faster and also more reliable. You can run test automation at any time and more frequently. Since it is automated you can also include the less important tests with any issues and this way gain more confidence. It’s also a lot cheaper and can run outside working hours.

It requires a different approach than simply scripting scenarios though.

How to automate tests

Whenever a tester runs a manual check, they log in and have a broad goal of what needs to be done. There are no broad goals when it comes to test automation. A testing script needs a precise goal, precise result and precise set of steps. Without those, an end-to-end test will become overly complex, with way too many conditions altering the end result.

This means that test automation needs to be more streamlined. It’s typically a good practice to follow some kind of pattern, so that there’s a standard to be followed for the whole test automation team. One of the good ones is the AAA pattern, which follows a set structure:

This structure divides a test automation script into three distinct parts, each fulfilling a different objective.

Arrange is a test preparation phase. It has the goal of making sure that the data is properly seeded, the application under test is in a proper state and the context is clearly set. Typically, the arrange phase can be something like a login, creating some test data or simply opening the proper subpage.

Act is the heart of the automation script. It’s a series of steps, closest to what the test case might be. This phase is responsible for getting the application from state A to state B. The series of steps are implicitly asserting that the functionality of the application under test is unbroken and allows the user to use the app’s functionality. In a to do app, this would be creating, editing or deleting a to-do item.

Assert is the most important step of a test. Without an assertion, the script is merely a series of steps. Assertions are an explicit confirmation that the functionality of the application under test works. In a to-do app, assertion might make sure that a to-do item is visible after it was created, no longer visible after it was deleted and so on.

Manual vs. automated

You could probably see some resemblance of the AAA pattern in manual testing. Even in manual testing there’s preparation, exploring and assessment. But as we stated earlier, that does not mean that simply automating a test case is going to work well.

Let’s demonstrate this in an example. The following is a cookie message that appears on a page:

This is a message that can appear seemingly randomly on a page. If testing this message is not within the scope, then the best course of action is to simply click “accept all” when the cookie consent message appears and move on.

In test automation this poses a challenge. The message might be covering a portion of the page and we need to get it out of the way when it appears. Test automation scripts usually run in a clean browser which means that we will see this message more often than when testing manually. This might be aided with a condition in the test script. In pseudo code it might look like this:

If you are shaking your head at this, I have to tell you that I have seen this approach many times. This little helper introduces a problem into the test automation suite. The script closes the cookie message anytime it runs, but what if the cookie message never appears, even when it should? And how do we test a case when we want the message to appear?

The key to making a good test automation decision is to have decent technical knowledge. Digging a little bit deeper into the cookie message functionality reveals that when we click the “accept all” button, a setting (in the form of a cookie) is saved into the browser storage. This ensures that we don’t see the message on the same page over and over again.

This is now a functionality we can test! Even better, it’s a functionality that we can control. Since most test automation frameworks always open browser with storage cleaned up, we can control when the cookie message appears by choosing when to inject the consent cookie into the browser.

Making good test automation decisions

There’s a very popular idea in the world of software development known as DRY - don’t repeat yourself. It’s a simple principle stating that whenever you have a piece of code that needs to be used at multiple places, it should not be repeated, but rather abstracted. Many testers apply this principle to their test automation code.

Let’s again demonstrate this in a simple example.

Imagine that there are 500 end-to-end tests written for an application that requires users to log in. For the sake of simplicity we’ll assume that every one of these 500 tests require the user to be logged in.

Applying the DRY principle to our test code means that we are going to create a login function that will get called at the beginning of each test as the “arrange” section.

And while the code is being abstracted, executing the login sequence in 500 tests could mean that simply logging in will take 25 minutes of the whole test run (given that login sequence takes 3 seconds). For some tests this can create a bit of imbalance in how long each part of the test takes. After all, the reason why we write tests is more connected to the act and the assertion phases.

The good news is that this draws a clear picture on what we need to focus on if we want to optimize.

In this case, we can use the same principle as we did with our cookies. It’s pretty much the same idea as when checking the “remember me” box when logging in to a page. But instead of using it to automatically log in to your favourite social media, you can simulate the same behavior in your tests.

In principle, this approach has the goal of creating relevant contexts and then re-using them. Instead of starting each test from scratch, we want to reuse contexts in as many tests as we can. Login is a great candidate for this, because login does not change from test to test and it’s also rarely affected by other test phases.

Modern test automation tools such as Playwright and Cypress already have tools for caching login sessions. This approach is not yet widely adopted, but it is the best way of dealing with applications that use login. Octomind provides an option to set up a shared authentication state as well as native support for one-time passwords.

This not only optimizes the test performance, but reduces the amount of login attempts. This helps tremendously when testing applications that have rate-limiting or captcha protection against brute force attacks. These can sometimes be a significant hurdle when it comes to test automation.
‍

Improving test actions

Another area of improvement is the “act” section of the test. While it’s not the obvious first candidate, there’s a lot of potential in making tests faster and therefore making the test automation worth it.

There’s a lot that happens in the “act” phase when doing e2e testing. It’s a good practice to ask whether parts of this phase are being reused across tests. Let’s say you are testing a to-do app. There’s a good chance that you need to create an initial todo item (or more) to meet your testing goal. This can potentially bloat your “act” phase.

Usually when following the DRY principle, testers abstract creation of todo items into its own function. And this is a good thing to do, because now an action can be reused in multiple tests.

conceptual graph showing repeated necessity of 'creating todos' steps in testing to-do app
But if you stop and think about it, isn’t creating todo items the part of our test that is responsible for arranging the test?

It definitely is!

This means that if we want to draw our graph correctly, it should look more like this:

It now seems that we have once again made the arrange part of our test too big. But this is once again a situation where we can apply the same principle as with login. Once we understand the parts that create the setup of our tests, we can simply abstract them and reuse and optimize them across different tests.

Test chaining and dependency structure

Chaining tests using dependencies is a best practice to structure test automation efficiently. In an automated approach, each test case represents a small task in the user flow you want to achieve. Testing an entire user flow is the execution of a sequence of test cases chained together.

Every test case is executed only once which reduces test runtime significantly. It follows the DRY principle allowing for easier test maintenance. Separation minimizes the number of test code adjustments when the code in your app changes. This is valid for both, hand-written tests in Playwright or Cypress and autogenerated tests by Octomind.

Dependency structure provides you and your collaborators with a better oversight when your test suite gets bigger. It gets harder to understand what each test is doing or how much of your app is actually covered over time.

Test cases vs. test scripts

This brings us to the final difference between manual testing with test cases and test script automation. Usually, when test cases are put together, they follow a pattern of user behavior. These behaviors are then added to groups, so that testing is done within a scope of given features and functionalities. Test scripts, while they should mimic user behavior, do not need to follow the same grouping. The grouping can be done based on the setup each test needs.

---‍

Testing is highly analytical work and requires a good knowledge of the system under test. Test automation requires it all, but also requires some analytical work on what needs to be broken down in order to run tests in an optimal way. Many times, the optimization process is like a mathematics equation, where you can subtract parts that are redundant. A manual approach with test cases is full of these parts and the art of effective test automation decision lies in identifying them.

Prompt engineering your boss

Veith Röthlingshöfer — Thu, 21 Dec 2023 17:29:55 +0000

Our designer started playing around with Midjourney to generate caricature stickers of our team in Octomind’s app trademark colors during one of his breaks. These were intended to be a fun addition to our Slackmojis, FigJam stickers, and Graphite’s meme collection.

We usually lean into the useful side of generative AI when working on our AI-powered UI testing tool. But hey, it's the festive season, so why not have some fun with it and share it with you guys.

(DISCLAIMER: At Octomind, we pride ourselves on having a friendly work environment where we get along pretty informally. This might not be the ideal Christmas present for your boss!)

Here are a few examples to give you an idea of what I’m talking about:

‍
Our designer would upload a reference image and provide small prompts like <image link> <core emotion>, cartoon sticker style, mint green and dark lavender. He would then iterate through numerous attempts until he arrived at something that looks about right. Some of the iterations were downright terrifying, but that's all part of the process.

‍I, for one, went through several iterations before arriving at a sticker that somewhat(?) resembles me. It created gems like these:

Side note: Midjourney has a penchant for generating plants and other random items in the background. I suppose it's because the embedding distance from "lavender" to "plant" is quite short.

All in all, it’s been a lot of fun, so we thought our boss, the co-founder & CTO, might enjoy it too.

What do you say, Daniel R.? Are you up for some caricature stickers of yourself?

He turned out to be a tough nut to crack. Whatever Midjourney generated, it never resembled him at all. Daniel became known as ‘the Unmidjourneyable’.

After playing around with it for a bit on my train ride back home from our Christmas party I found a workflow that produced decent results. It took several tools and little tricks I’ll share with you here.

Midjourney vs. Dall-E 3

We use Midjourney to generate octopus imagery within our brand style. The typical Midjourney style is usually getting us closer to visual consistency. OpenAI’s Dall-E 3 is, however, a lot better at understanding complex prompts than Midjourney is. So I knew that I would probably want to start there.

Simply combining the image and prompt in Dall-E wasn't producing the results I wanted. The model seemed to struggle with the requested "mint green and dark lavender color palette" more than Midjourney did.

I also had to be a bit careful with my wording here. GPT doesn’t want you generating images of other people and stops the prompt.

Style transfer with Dall-E

First, I needed a way to generate something in a similar style to what we want. We already have a few example stickers where Midjourney worked well (shout out to my colleague Fabian’s face for being very prompt-able!), and I already knew how from a private project.

You can ask Dall-E to analyze an image and generate a detailed description of the art style. Then, I used that description as a guideline for the art style in a secondary image generation step. This technique also allows you to keep the style relatively consistent between new chats and image generations!

The prompt I used was:

Me: Describe the style of this image. Do not focus on the actual content, only on the style, colors and other artistic elements. Name some art styles and artists that fit this image. Be as detailed as possible.

Response: The image uploaded features a caricature-style illustration with exaggerated facial features commonly found in modern caricature art. The subject's head is larger in comparison to the body, emphasizing the facial expression, which in this case is one of anger or annoyance. The use of bold, dark outlines to define the shapes is reminiscent of comic book art, particularly the styles found in graphic novels or mature-themed comics.

The colors are vivid but not overly bright, suggesting a more subdued palette that enhances the serious tone of the expression. The use of shading and highlights on the face gives the image a three-dimensional feel, adding to the impact of the expression.

This style of illustration might be influenced by artists like Mort Drucker or Tom Richmond, known for their work in MAD Magazine, where such exaggerated features and expressions are a hallmark. However, it also contains elements seen in the works of street artists who often employ bold lines and vivid colors to create impactful imagery.

In terms of art styles, it aligns with Pop Art in its boldness and accessibility, as well as with Expressionism in its focus on representing emotional experience rather than physical reality. The style does not fit directly into the classic fine art categories but has a contemporary edge that is often seen in digital art, graphic design, and illustration.

Next, read the response carefully and adjust it as desired. Dall-E sometimes sneaks in phrases you do not want to have in your style description. In this case, we want to use it as a style guide later, so I replaced The image uploaded features […] with The style should resemble […]. Additionally, I didn’t want Daniel’s sticker version to be angry, so I replaced [...] in this case is one of anger or annoyance with [...] in this case is one of hype and excitement.

I added some context to the following prompt template to generate our sticker:

You are an expert in creating amazing caricature sticker designs based on a textual description of a person and a desired style. Generate a sticker based on the description: <add description> And the style: <add style>

I will refer to this as “the template” from now on. Next, I had to make sure that the sticker looks similar to our co-founder.

Playing police

The only way to interact with either image generation model is through a textual description of what you want to see. Prompts like “make it look like the photo” don’t work that well as seen above, and I am not that great at describing a person’s looks that I would trust it to be an accurate and reproducible description.

I thought: “Who is better at drawing someone based on a description than a police sketch artist?”

Using the following prompt, I fed Daniel’s image to ChatGPT to get a description:

Me: You are an expert police sketch artist. You know everything about describing people to recreate a face from a textual description. Generate the perfect description for this man. Go over each aspect that makes him unique, describing as many individual physical features as possible. Use terms that set each feature in relation to how that feature looks on other people. Do not focus on the background. Make sure you could recreate this man based on the description you provide.

Response: The individual in the image is a male appearing to be in his mid-30s to early 40s. He has a fair complexion with a warm undertone, and his face shape can be classified as oval, with a slightly elongated chin that gives a hint of a heart-shaped contour. His hair is short and styled in a casual, slightly tousled manner, exhibiting a peppered ash brown color, indicative of the onset of graying primarily at the temples.

His forehead is broad and visibly unlined, leading to well-arched, medium-thickness eyebrows that are neatly groomed, slightly lighter than his hair color, suggesting they may lighten further with age. His eyes are almond-shaped, with a noticeable palpebral fold, and the irises are a cool hue, possibly blue or grey, framed by average-length lashes that do not appear particularly thick.

The nose is straight with a narrow bridge and a rounded, average-sized nasal tip, showing no signs of deviation or prominent bumps. The nostrils are symmetrical and of moderate size. His cheeks are relatively flat, with a subtle hint of cheekbones just below the mid-eye level. The man has a gentle smile, which slightly elevates the cheeks without revealing the teeth. His lips are of medium thickness with the upper lip slightly thinner than the lower, and the lip color is a healthy pink, in harmony with his overall complexion.

His ears are of average size, proportionate to his face, with the lobes attached and no prominent helix or antihelix features. His jawline is defined, though not sharply angular, leading to a well-proportioned chin that complements his face without being overly prominent.

He wears no visible jewelry or piercings and his attire suggests a practical outdoors orientation - he's wearing a dark olive green hooded jacket with the brand "Columbia" visible on the right side of the chest in white lettering, indicating a preference for functional, perhaps outdoor or athletic clothing.

The overall impression is of a man who takes care in his appearance without it being overly groomed or styled, presenting a natural and approachable demeanor.

Then, I used this description and the template to generate a first sticker version:

This is already closer than anything we managed to do with the picture and a short prompt, but still not really recognizable. I’d need a feedback loop!

Continuing in the chat where I generated the first description, I asked it to analyze the generated image, the photo and the description and update the description to clear up any misconceptions, but GPT-4 got confused about which image is which and just generated a new description for the generated image.

I needed a new trick.

Playing professor

Actually, I’ll pull not only one, but two tricks out of my hat.

It was very obvious that GPT is confusing which image I’m referencing at what point. I was referring to them as “the first” and “the second”, or “the photo” and “the illustration”, but that wasn’t working, especially over the course of several interactions.

First hack was to combine the photo and the illustration into one image using the open-source image editing tool GIMP.

Now, I could refer to them as “the photo on the left” and “the illustration on the right”, which really seems to reduce the confusion.

The second trick is to adjust the context of who's describing the image and generating the prompt. A police sketch artist may be great at drawing a person based on a description, but is not necessarily good at generating that description in the first place. So I turned it into a very critical professor instead, which produced a way more detailed result:

Me: You are an expert police sketch artist. You are currently giving a lecture on how to describe people so that a police sketch artist can accurately reproduce their face. To teach people, you are letting them perform the following task: "Write a description of a photo of a man that I have given you. Another police sketch artist will draw the man based on your description. I will criticize and improve your description after we have seen the resulting illustration". You are now criticizing the description one of your students generated. The description he generated is: .

Response: The description provided by the student is quite detailed and paints a clear picture of the individual in question. However, to refine it for increased accuracy in a police sketch, we can focus on a few key areas:

Facial Structure: The description mentions an oval face shape with a hint of a heart-shaped contour due to an elongated chin. To improve, we could specify the ratio of forehead to lower face to better capture the balance of facial features.

Eyes: While 'almond-shaped' is a good start, we could provide the relative distance between the eyes and the exact shape and orientation of the eyelids.

Nose: Instead of 'straight with a narrow bridge,' we might describe the exact length of the nose in relation to the width of the face and the shape of the nasal tip more precisely.

Mouth and Smile: The description of the smile is good, but the shape of the mouth when smiling can vastly change the perception of the face. Is the smile wide, does it extend beyond the eye line, etc.

Hair: More specifics on the distribution of graying and the texture of the hair would be helpful.

Complexion: We can clarify 'fair with a warm undertone' by giving examples or comparing with common complexion types.

Clothing: The description is adequate, but the position of the jacket zipper, the hood, and how the clothing sits on the body could help.

Here's an improved description: ... a slightly better description ...

It is generally a good idea to let LLMs break down the task into several steps. In this response, it did provide some very good criticism, but didn’t actually apply it in the improved description. The result using this description looked like this one below. The image was generated in the same chat as the previous version, without supplying the style instructions again.

So, I asked it to apply the criticism and generate a more detailed description.

Response: The subject is a Caucasian male estimated to be in his late 30s. His complexion is fair with a neutral undertone, leaning slightly towards a peachy warmth. The face shape is an even blend of oval with a subtle hint of heart-shaped, primarily due to the slightly wider forehead and a gently tapered chin which is softly rounded at the tip, avoiding sharpness.

His hair is cut short, with a texture that suggests a thickness in the strands. The color is a rich ash brown with a prominent sprinkling of gray, more concentrated around the temples and blending into the natural color towards the crown. There is a subtle side parting on his left, and the hair follows a natural growth pattern with a slight elevation at the front, giving a casual, tousled appearance.

The forehead is broad, occupying a third of the facial length, smooth, with no visible furrows or frown lines. His eyebrows are natural in shape with a soft arch, groomed but not sculpted, slightly lighter than the hair on his head, hinting at a sandy brown shade. They are of medium thickness, neither bushy nor sparse.

Eyes are of a size that balances with the other facial features, set at a moderate distance from each other. The shape is a classic almond, with upper eyelids having a prominent but not heavy palpebral fold. The eyes are reminiscent of the color of the summer sky, a clear blue with a grayish tint. The lashes are of average density and length, framing the eyes in a way that is noticeable but not striking.

The nose is well-proportioned to his face, straight, with a bridge that is of moderate width. The tip is rounded and slightly bulbous, which gives character without dominating his facial features. The nostrils are symmetrical and proportionate to the size of the nose.

His cheeks are smooth, with a very subtle definition of cheekbones aligned with the center of the eyes. They do not protrude but give structure to his face. The cheeks transition into a gentle smile that lifts the area modestly, causing a small crinkle at the outer corners of the eyes, which adds a friendly warmth to his demeanor. His lips are naturally contoured, the upper lip being slightly thinner than the full, more pronounced lower lip. Both have a healthy blush pink tone, and the smile does not stretch broadly across the face but is enough to show a hint of dimples.

The ears are in proportion with the facial size, with the tops aligning with the brow line and the bottoms with the base of the nose. The lobes are attached, and there are no distinctive markings or piercings.

His jawline is defined without being overly angular, leading to a chin that complements his oval face without becoming a focal point. There is a slight indentation below the lower lip, providing a subtle division between the chin and the rest of the face.

The attire, a dark olive green hooded jacket with a visible 'Columbia' logo on the right chest, suggests an affinity for outdoor wear. The jacket is worn in a relaxed manner, zipped to mid-chest, with the hood resting flat against the back. The fabric appears to be a durable, water-resistant material suitable for outdoor activities.

Overall, the man has a relaxed, congenial appearance, with well-maintained but not overly styled features, conveying an impression of practicality and an easygoing nature.

Now, this reads like something usable! Another good thing about this process is that you can iterate it until you are happy (someone should automate that cough, cough).

The final push

Using this last description, I got the following image from DALL-E:

It looks relatively recognizable, but is not as exaggerated as the original sticker, and it also lacks the mint green and dark lavender color palette. And as it turns out, Daniel is Barney Stinson’s long lost half-brother!

But we already know that Midjourney is quite good at molding images into caricatures and following a color scheme. Yet, the complexity of generating the sticker has been greatly reduced.

At the beginning, it received a photo and needed to abstract it into a sticker, a caricature, change the displayed emotions, change it to an illustration and switch to an unrealistic color palette. Now, it only needs to exaggerate the emotions and switch the color palette.

And it did a great job! Feeding this image in Midjourney resulted in what I was going for.

I used the prompt <image url> exaggerated hype and excitement, mint green and dark lavender color palette, sticker design, caricature --iw 1.75

I played with the image weight parameter –iw a bit, and found that 1.75 strikes a balance between sticking to the source material (--iw 2) and allowing the changes I described in the text prompt (--iw 0).

I also found that it helps to not only specify “mint green and dark lavender”, but to explicitly add the word “color palette”, which significantly reduces the amount of random plants in the background.

The only thing left to do is to cut out the sticker from the background with GIMP. Even if the background is not uniformly colored, it’s pretty easy thanks to the white outline of the person.

Voilà!

I added a little quote for more pizzazz.

Prompt engineering your family

With the holiday season approaching fast, I couldn't help but think this might be a unique gift idea. When I introduced the recipe to my team, one of my colleagues immediately began prompting her family for stickers as a Christmas gift.

Why not share this with people who supported Octomind in 2023?

I created this blog to share with Octomind friends, users and fans. This is our special, genAI way of saying: thank you.

Veith Röthlingshöfer
ML engineer at Octomind