Forem: Olga Braginskaya

How I'm Optimizing My Small Blog to Get Cited by AI (Based on Actual Research)

Olga Braginskaya — Thu, 09 Apr 2026 06:56:13 +0000

AI ate the traffic and nobody sent a thank you note.

LinkedIn's own marketing team admitted that AI search wiped out 60% of their traffic. Not because rankings dropped but because people stopped clicking, AI answered the question before anyone reached the link. Google is doing the same thing with AI Overviews. They put the answer right there on the search page and it reduced clicks to top-ranking content by 58%.

This isn’t limited to big publishers, smaller blogs feel it too. I run a data engineering blog for three years and the audience hasn’t disappeared, but the interaction changed: people get what they need directly from AI and your page is no longer part of that first step.

But the click didn’t disappear, it moved. Google AI Overviews shows source links next to its responses, ChatGPT adds citations, Perplexity links to every source it uses. Now you’re not just competing for a spot on a search results page, you’re competing for a place inside the answer itself. And if you’re cited, you’re one of three or four links people actually see, not one of ten they scroll past.

So the question became: how do you get cited? I started digging into it and found that the SEO world already has a name for this: AEO.

Most AEO advice is just SEO with a new label

I spent a while on Reddit reading through AEO threads and honestly most of it was discouraging. The tools that used to do keyword research now have AI optimization features bolted on, but they want $100-200 a month for it, packed with features built for brands managing campaigns at scale. That's a lot when you're running one blog and your marketing budget is zero. And a lot of the advice still felt like repackaged SEO wisdom: add schema, optimize your headings, follow this checklist, just with "for AI" added to the title. Some of these tools claim they send requests to AI providers and check if you're being cited, though I haven't seen any real proof of that (and certainly not for $100 a month).

One useful thing I picked up from Reddit is that AI changed search from keywords to questions. People don't type "parquet file compaction python" into ChatGPT. They ask "how do I compact small Parquet files without Spark." So your headings should work as questions and your sections should start with the answer directly, because LLMs are like people - they don't want to read three paragraphs before getting to the point.

Someone actually measured what gets cited

Then I saw this Reddit post that was actually different from everything else. A team called Indexably took 18,000 real pages across ChatGPT, Claude, Gemini, Perplexity and Google AI Overviews and actually measured what's statistically different between pages that get cited and pages that don't. Five full research posts with real numbers and methodology published openly.

Here's what I took away as a blogger.

Domain reputation is 77% of the equation

Domain-level signals like backlinks, brand recognition and how well-known your site is account for 77% of what predicts citation. Only 23% is page-level. So realistically if your blog is small, this is the main constraint and you can't "optimize" your way around it.

But since AI is reading your content I think it's worth making your existing expertise visible in the text itself. Not a formal bio, just naturally mentioning that you've been doing this for years, that this solution ran in production, that you've seen this problem across multiple projects. The kind of details that signal to both readers and AI that you actually know what you're talking about. For example in this post I mentioned in the intro that "I run a data engineering blog for three years", that's already enough to give context about who I am without turning it into a resume.

Backlink diversity beats volume

Ten links to your blog from ten different places are worth more than a hundred from one. They measured this specifically and backlink diversity is more than twice as predictive as raw backlink count. This is one of the few things that actually feels actionable at a small scale.

I found that some targets are easier than others. Platforms like Dev.to or Hashnode let you republish without moderation so you can cross-post there with minimal effort. Hacker News works well too, you can submit your own posts. I submitted my updated PyArrow compactor post there and got 60 visitors in 24 hours. That's not a viral moment but for a blog my size and a pretty niche topic about Parquet file compaction it's not bad either.

Getting into big newsletters is nearly impossible but smaller ones, especially from people just starting out, are much more approachable. Slack communities are underrated, especially smaller learning-focused ones. For example DataTalks.Club has a dedicated self-promotion channel. Reddit is basically useless for this, they'll ban you for dropping links and I haven't figured out a way around it. LinkedIn works relatively fine if you have at least a hundred or so followers. Sometimes posts get a few thousand views, but it has to be a real post with a catchy opening paragraph, not just a link drop, because nobody reads past the first few lines.

What actually matters on the page

Most page-level optimization only matters if your domain is already strong. But it’s still worth doing because it makes your content easier for AI systems to read and extract.

So what actually matters on the page:

The strongest page-level signal is basic HTML. Canonical, lang, meta description, viewport, doctype.
Cited pages aren't longer, they're better structured. Shorter paragraphs, more varied vocabulary, shorter sections. Word count is slightly negatively correlated with citation. Writing more for the sake of it can actually hurt.
Include real numbers and statistics. A Princeton study showed that adding stats improved AI visibility by 22% to 41%. So LLMs are like people, they also like staring at numbers and stats. For a small blog this feels like one of the most realistic things to experiment with.
Each AI platform wants different things. ChatGPT leans toward freshness and metadata. Gemini cares about crawlability and is the only model where page optimization gives a large measurable lift. Claude spreads weight more evenly and values statistics and author info the most.
If your schema is injected by JavaScript most AI crawlers won't see it. Only Gemini renders JS. So a lot of the "advanced setups" people pay for don't even apply unless they're in static HTML. This was an interesting discovery for me personally because I used to embed code samples via GitHub Gists, which load through JavaScript. That means most AI crawlers saw empty space where my code should have been. So if you're writing technical posts, use native code blocks instead of embedded scripts.

The 90% that no one can measure

The most honest finding is that all of this explains only about 10% of what drives citation. The remaining 90% is probably just whether your content actually answers the question someone asked.

What am I actually going to change on my blog?

I took the research findings, put them into Claude and ChatGPT and built a prompt that works as an AEO reviewer (take the prompt and run it on your own posts). I paste the prompt, paste my article and it checks the text against the research data and gives me a prioritized list of what to fix.

Here's what the review showed me about my own posts:

My code samples were in GitHub Gists which load via JavaScript, so most AI crawlers couldn't see them. I should move almost everything to native code blocks, except for really large snippets.
I tend to get carried away when I write and end up with massive paragraphs and sections that go on forever. Though you also shouldn't blindly follow everything the LLM says, for example ChatGPT complained that the "What actually matters on the page" section in this post is a list and doesn't match the rest of the style. It wanted me to rewrite it as flowing prose. But I actually like how the key points read as a clear list and that kind of clarity gets lost when you turn it into paragraphs.
The review prompt also helped me see actual content gaps. For example in my PyArrow compactor post I never mentioned what size the files were before and after compaction or how many output files I ended up with. There were also zero screenshots in the entire post.
I have a habit of building up to the answer instead of starting with it. I write and write and the actual point shows up somewhere at the end. I should be making films, not blog posts.
I'm also lazy about adding a featured image for each post and for the X card. It's hard to come up with an image for a technical article but then your links look ugly when you share them on LinkedIn or anywhere else.

And honestly the whole thing looks pretty cool, like you just had a conversation with a PhD in content analytics or something:

The 90% I can't control

Domain authority takes years and I'm realistic about where my blog sits. And the 90% that depends on whether your content actually answers someone's question is not something you can fix with meta tags.

I don't expect any of this to suddenly bring me a hundred thousand readers. But it's an interesting thing to work on and the research is detailed enough to put into an LLM and build your own review prompt, whether it's for a brand or a blog. I'm still going through my posts one by one, fixing things when I have time. Some changes take five minutes, some make me rewrite half the post. At least now I have some idea of what actually matters instead of guessing.

Subscribe on datobra.com to not miss new posts. Updates: @ohthatdatagirl.

Know your tokens. Own your costs.

Olga Braginskaya — Wed, 25 Mar 2026 10:18:18 +0000

A few weeks ago someone on Twitter (or it's X.com now) was mansplaining to me that LLM tokens are basically random, that nobody really knows how they are calculated and that there is no way to log or control them. Which was unfortunate for him because at my previous job I was doing observability on token usage for a POC agent, tracking input and output tokens per model through the OpenAI API.

I did not win that argument because I just banned the guy, but it got me thinking. Not because he was wrong, that part was obvious, but because the confusion seemed real. Tokens sound abstract, the pricing pages list them but never really show what they are and if you have not worked with the APIs directly it is easy to assume there is some black box magic involved.

There is not. Tokens are deterministic, countable and loggable. Some providers like OpenAI and Google even let you see exactly how your text gets split and count tokens before sending a request and every API response tells you exactly how many tokens went in and came out. Let's see how it actually works.

One note before we start: this post focuses on text tokens only. Image, audio and video inputs have their own tokenization rules and pricing, but that is a separate topic.

What is a token

Before we get to any code, let's understand what we are actually talking about.

When you send text to an LLM, the model does not read words or characters. It reads tokens, which are chunks of text that the model learned to treat as single units. A token can be a whole word like "hello", a piece of a word like "ing" or "pre", a single character like "." or even a space. The model has a fixed vocabulary of these chunks, usually between 50,000 and 200,000 of them and every piece of text you send gets split into a sequence of tokens from that vocabulary.

This was not always how it worked. Early NLP models tokenized by words, which sounds intuitive but breaks down fast because every misspelling, new word or compound term becomes an unknown token. Then there were character-level models that read one letter at a time, which solved the unknown word problem but made sequences extremely long and expensive. Modern LLMs use something in between called Byte Pair Encoding. BPE starts with individual characters and then iteratively merges the most frequent pairs into new tokens until the vocabulary reaches the target size. So common words like "the" become a single token, while rare words get split into smaller pieces that the model has seen before.

The practical result is that common English text tokenizes to roughly 1 token per 4 characters, but that ratio changes a lot depending on the language, the domain and the specific tokenizer.

You can see all of this happening with tiktoken, which is OpenAI's open source tokenizer library. I chose my prompt for this article as classical interview question 'What happens when you type a URL into a browser and press enter?'

import tiktoken

enc = tiktoken.encoding_for_model("gpt-5")
text = "What happens when you type a URL into a browser and press enter?"

tokens = enc.encode(text)
print(tokens)
print([enc.decode([t]) for t in tokens])

That gives you the token IDs and the actual text each one represents:

[4827, 13367, 1261, 481, 1490, 261, 9206, 1511, 261, 10327, 326, 4989, 5747, 30]
['What', ' happens', ' when', ' you', ' type', ' a', ' URL', ' into', ' a', ' browser', ' and', ' press', ' enter', '?']

Let's try the same question in Russian text = "Что происходит, когда вы вводите URL в браузер и нажимаете Enter?" and see how the splits change.

[63048, 63017, 11, 21029, 3341, 108579, 6989, 9206, 743, 120026, 1025, 816, 90565, 2271, 82786, 12240, 30]
['Что', ' происходит', ',', ' когда', ' вы', ' ввод', 'ите', ' URL', ' в', ' брауз', 'ер', ' и', ' наж', 'им', 'аете', ' Enter', '?']

The English version produced 14 tokens while the same question in Russian took 17 and you can see why - words like "происходит" and "нажимаете" got split into multiple pieces because the tokenizer's vocabulary has fewer Russian subwords to work with which is why I usually work with LLMs only in English.

One more thing worth knowing: if you do not want to install anything locally, OpenAI also has an online tokenizer at platform.openai.com/tokenizer where you can paste text and see the splits visually. It is useful for quick checks but for anything repeatable you want the library.

Same text, different bill

Different models use different tokenizers with different vocabulary sizes, which is why the same text produces different token counts depending on which model you use. Older models like GPT-4 use a tokenizer called cl100k_base with roughly 100,000 tokens in its vocabulary, while newer models like GPT-4o, GPT-5 and the reasoning family (o1, o3, o4-mini) all use o200k_base which has about 200,000. Bigger vocabulary means more common patterns get merged into single tokens, which means fewer tokens for the same text, which means you pay less per request.

You can see this directly with tiktoken:

import tiktoken

text = "What happens when you type a URL into a browser and press enter?"

for model in ["gpt-4", "gpt-5"]:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    print(f"{model}: {len(tokens)} tokens")
    print(tokens)
    print([enc.decode([t]) for t in tokens])
    print()

The output:

gpt-4: 14 tokens
[3923, 8741, 994, 499, 955, 264, 5665, 1139, 264, 7074, 323, 3577, 3810, 30]
['What', ' happens', ' when', ' you', ' type', ' a', ' URL', ' into', ' a', ' browser', ' and', ' press', ' enter', '?']

gpt-5: 14 tokens
[4827, 13367, 1261, 481, 1490, 261, 9206, 1511, 261, 10327, 326, 4989, 5747, 30]
['What', ' happens', ' when', ' you', ' type', ' a', ' URL', ' into', ' a', ' browser', ' and', ' press', ' enter', '?']

In this case both produce 14 tokens and the splits are identical - the text is simple common English so both vocabularies handle it the same way. The difference shows up in the token IDs, which are completely different numbers because they are two separate vocabulary tables. Where it starts to matter is longer text, technical jargon, code or non-English languages, where the larger vocabulary has a better chance of fitting things into fewer tokens.

But that is just OpenAI versus OpenAI. Claude, Gemini and others each have their own tokenizers built on their own training data and none of them are compatible with tiktoken. You cannot run Claude's tokenizer locally the way you can with OpenAI's.

What you can do is count tokens before making a request. Anthropic has a dedicated /v1/messages/count_tokens endpoint:

from anthropic import Anthropic
import os

os.environ["ANTHROPIC_API_KEY"] = 'your_key'

client = Anthropic()
text = "What happens when you type a URL into a browser and press enter?"

response = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": text}],
)
print(response.input_tokens)

Claude Sonnet 4.6, Haiku and Opus returned 21 tokens for the same input, which makes sense because models from the same provider typically share the same tokenizer - the difference between Sonnet and Opus is in capability and price, not in how they split text.

Gemini shows the same pattern within its own family. Using Google's count_tokens API:

from google import genai

client = genai.Client(api_key="YOUR_KEY")
text = "What happens when you type a URL into a browser and press enter?"

for model in ["gemini-2.5-flash", "gemini-3-flash-preview"]:
    result = client.models.count_tokens(model=model, contents=text)
    print(f"{model}: {result.total_tokens} tokens")

Both models returned 15 tokens - but that is 15 versus Claude's 21 for the exact same text. Different provider, different tokenizer, different count. For more details see Google's token counting guide. Anthropic also has a detailed guide on token counting, see Token counting in the Claude docs.

So far we have been counting tokens before making a request, which is useful for cost estimation and prompt optimization. But in production what matters is what comes back - every API response includes the actual token usage for that call, broken down into input and output tokens. Some models also report reasoning tokens separately. Let's make the same request to all three providers and see what we get.

What comes back: tokens in API responses

Counting tokens before a request is useful for cost estimation, but in production what matters is what comes back. Every API response includes the actual token usage for that call, broken down into input and output tokens. Some models also report reasoning tokens separately. Let's make the same request to all three providers and see what we get.

Claude:

import os
from anthropic import Anthropic

os.environ["ANTHROPIC_API_KEY"] = 'YOUR_KEY'

client = Anthropic()
text = "What happens when you type a URL into a browser and press enter?"

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": text}],
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

Claude's response:

What Happens When You Type a URL and Press Enter

This is a classic technical interview question. Here's a thorough breakdown:

URL Parsing The browser breaks down the URL into components: Protocol (https://), Domain (example.com), Path (/page), Query params (?id=1)
DNS Resolution The browser needs to find the IP address for the domain. Check order: Browser cache, OS cache / hosts file, Router cache, ISP's DNS resolver, Recursive query through Root → TLD → Authoritative nameservers. Result: An IP address like 93.184.216.34
TCP Connection A TCP handshake occurs (SYN → SYN-ACK → ACK). For HTTPS, a TLS handshake follows: Negotiates encryption protocol, Server sends certificate, Browser verifies it, Encryption keys are exchanged
HTTP Request The browser sends a request: GET /page HTTP/1.1, Host: example.com, Headers: (cookies, user-agent, etc.)
Server Processing Server receives request, Routing, business logic, database queries, Builds a response
HTTP Response HTTP/1.1 200 OK, Content-Type: text/html, [HTML body]
Browser Rendering Parse HTML → build DOM tree, Parse CSS → build CSSOM tree, Combine into Render Tree, Layout - calculate element positions/sizes, Paint - draw pixels to screen, Compositing - layer management. Along the way, additional requests fire for JS, CSS, images, fonts, etc.

The whole process typically happens in milliseconds to a few seconds.

This entire answer cost 21 input tokens and 602 output tokens. The 21 input tokens are our prompt, the 602 output tokens are everything Claude generated above. That is what you are paying for on every API call. Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens (see Anthropic's pricing page for all models). So our call used 21 input tokens and 602 output tokens, which works out to $0.000063 for input and $0.00903 for output - roughly $0.009 total, less than a cent. That sounds like nothing, but multiply it by thousands of requests per day and those fractions add up fast, which is exactly why tracking tokens matters.

If you switch to a smaller model like claude-haiku-4-5, the same prompt returns 21 input tokens (same tokenizer, same count) but only 325 output tokens instead of 602. The input side stays the same because all Claude models share the same tokenizer, but the output side depends on how verbose the model decides to be and smaller models tend to give shorter answers. At Haiku's pricing of $1 per million input tokens and $5 per million output tokens, this call costs roughly $0.0016, almost six times cheaper than the same question on Sonnet.

And this is just the base cost - the full picture is more nuanced. When you enable extended thinking, Claude generates thinking tokens on top of the visible output and those are billed at the same output rate. There are also ways to reduce what you pay, like prompt caching which lets you reuse previously processed input tokens at a fraction of the price. We will get into all of that in the next chapter.

OpenAI:

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY")
text = "What happens when you type a URL into a browser and press enter?"

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": text}],
)

print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

OpenAI returned 20 input tokens and 688 output tokens with reasoning_tokens: 0. GPT-5.4 is a reasoning model but in this case the question was simple enough that it did not need to think - the reasoning token count stays at zero. With a harder prompt that number stops being zero and becomes a significant part of your bill.

Gemini:

from google import genai

client = genai.Client(api_key="YOUR_KEY")
text = "What happens when you type a URL into a browser and press enter?"

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=text,
)

print(f"Input tokens: {response.usage_metadata.prompt_token_count}")
print(f"Output tokens: {response.usage_metadata.candidates_token_count}")
print(f"Usage: {response.usage_metadata}")

The response on usage:

GenerateContentResponseUsageMetadata(
  candidates_token_count=1387,
  prompt_token_count=15,
  prompt_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=15
    ),
  ],
  thoughts_token_count=1203,
  total_token_count=2605
)

Gemini returned 15 input tokens and 1,387 output tokens. At $0.30 per million input tokens and $2.50 per million output tokens for Gemini 2.5 Flash (see Google's pricing page), that works out to roughly $0.0035 total - significantly cheaper than Claude for this particular call. But notice something else in the response: thoughts_token_count: 1203. Those are thinking tokens that Gemini generated separately from the visible output. The total was 2,605 tokens, not 1,402.

The hidden tokens

We already saw hints of this in the previous section - Gemini's thoughts_token_count: 1203 and OpenAI's reasoning_tokens: 0. These are tokens the model generates internally before producing the visible answer and they are a real part of your cost.

To understand what reasoning tokens actually are, think about what happens when you ask a model a hard question. Before writing the answer, the model generates a chain of internal steps. It might restate the problem, list possible approaches, work through each one, check for mistakes and settle on a final strategy. All of that is text, real tokens generated one after another, exactly like the output you see. The difference is that the reasoning content itself is not included in the response text, you only see the final answer. But the token count is reported in the usage metadata, they still used compute and they are still billed as output tokens, because that is exactly what they are: output that the model produced but did not include in the visible answer.

OpenAI: reasoning effort

Reasoning Open AI models like GPT-5.4 supports a reasoning parameter with effort levels: none (default), low, medium, high and xhigh. When set to none, the model responds without thinking and reasoning_tokens stays at zero, which is what we saw in our earlier call. Let's ask the same question with reasoning turned on:

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY")
text = "What happens when you type a URL into a browser and press enter?"

for effort in ["none", "low", "medium", "high"]:
    response = client.responses.create(
        model="gpt-5.4",
        input=text,
        reasoning={"effort": effort},
    )
    r = response.usage
    print(f"effort={effort:6s} | input={r.input_tokens} | output={r.output_tokens} | reasoning={r.output_tokens_details.reasoning_tokens} | total={r.total_tokens}")

The output:

effort=none | input=20 | output=1102 | reasoning=0 | total=1122
effort=low | input=20 | output=862 | reasoning=11 | total=882
effort=medium | input=20 | output=1695 | reasoning=107 | total=1715
effort=high | input=20 | output=2205 | reasoning=430 | total=2225

The reasoning tokens go from 0 at none to 11 at low, 107 at medium and 430 at high. Those reasoning tokens are not a separate charge, they are counted inside the output tokens and billed at the output rate of $15 per million tokens. At GPT-5.4's pricing (OpenAI's pricing page) the none call costs $0.0166, low is actually cheaper at $0.013 because the model gave a shorter answer, medium jumps to $0.0255 and high hits $0.0331. Same prompt, same model, but the cost doubled from none to high because the model spent 430 tokens thinking and produced a longer answer on top of that.

Claude: extended thinking

Anthropic's equivalent is extended thinking. On Sonnet 4.6 and Opus 4.6 the recommended approach is adaptive thinking with an effort parameter, similar to OpenAI:

from anthropic import Anthropic
import os

os.environ["ANTHROPIC_API_KEY"] = 'YOUR_KEY'

client = Anthropic()
text = "What happens when you type a URL into a browser and press enter?"

for effort in ["low", "medium", "high"]:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16000,
        thinking={"type": "adaptive"},
        output_config={"effort": effort},
        messages=[{"role": "user", "content": text}],
    )
    print(f"effort={effort:6s} | input={response.usage.input_tokens} | output={response.usage.output_tokens}")

The output:

effort=low | input=21 | output=404
effort=medium | input=21 | output=557
effort=high | input=21 | output=603

The pattern is similar to OpenAI: higher effort means more output tokens. The thinking tokens are included in the output count and billed at the same $15 per million output rate. At low the total cost is about $0.006, at high it is $0.009. The jump is smaller than what we saw with OpenAI because for a simple question like ours Claude's adaptive thinking does not generate much internal reasoning at any level.

Gemini: thinking budget and thinking level

Gemini 2.5 Flash uses thinking_budget which is a raw token number from 0 to 24,576. It works but it is harder to compare across providers because you are setting a ceiling in tokens rather than picking an effort level. Gemini 3 models introduced thinking_level which works more like OpenAI's effort parameter with named levels: minimal, low, medium and high. Let's use gemini-3-flash-preview to get a comparable comparison:

from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_KEY")
text = "What happens when you type a URL into a browser and press enter?"

for level in ["minimal", "low", "medium", "high"]:
    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=text,
        config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level=level)
        ),
    )
    m = response.usage_metadata
    print(f"level={level:7s} | input={m.prompt_token_count} | output={m.candidates_token_count} | thinking={m.thoughts_token_count} | total={m.total_token_count}")

The output:

level=minimal | input=15 | output=893 | thinking=None | total=908
level=low | input=15 | output=1024 | thinking=576 | total=1615
level=medium | input=15 | output=1085 | thinking=590 | total=1690
level=high | input=15 | output=975 | thinking=403 | total=1393

At minimal there are no thinking tokens at all and the model produces 893 output tokens. Once you switch to low the thinking kicks in with 576 tokens and medium is similar at 590. Interestingly high used fewer thinking tokens (403) than low and medium in this case, which shows that the levels are ceilings, not guarantees: the model decides how much thinking it actually needs and a simple question like ours does not require deep reasoning regardless of the level you set. The cost difference is still real though. At Gemini 3 Flash pricing of $0.50 per million input tokens and $3 per million output tokens (Google's pricing page), the minimal call with 908 total tokens costs about $0.0027, while medium with 1,690 total tokens costs $0.005. Thinking tokens are billed as output tokens at the same rate.

The takeaway from all of this is that your cost per API call has three levers, not one. Changing the provider changes how your text gets tokenized and what you pay per token. Changing the model within the same provider changes the per-token rate. And changing the reasoning effort can multiply your output tokens several times over without touching your prompt. A single request can cost $0.003 on Gemini 3 Flash at minimal thinking or $0.033 on GPT-5.4 at high effort. That is a 10x difference for the same question. Knowing which levers exist is the first step, the next step is logging them so you can actually see what your application is spending.

Cached tokens

We saw cache_read_input_tokens: 0 in Claude's response and cached_tokens: 0 in OpenAI's response earlier. Those fields exist because all three providers support some form of prompt caching and when it kicks in, it can significantly reduce your input token costs.

If you send multiple requests that share the same prefix (a long system prompt, a document, a conversation history that keeps growing), the provider can cache the processed version of that shared part and reuse it on subsequent calls instead of processing it from scratch. You pay full price the first time, then a fraction of the price on every follow-up request that hits the cache.

How it shows up in each provider's response:

OpenAI reports cached_tokens inside input_tokens_details. Cached input tokens are billed at 50% of the regular input rate. Caching happens automatically, you do not need to opt in.
Claude reports cache_creation_input_tokens and cache_read_input_tokens in the usage object. Cache reads are billed at 10% of the regular input rate, which is a massive discount. Claude requires you to add a cache_control field to your request to enable it.
Gemini reports cached tokens through its context caching API. Cached tokens are billed at 75% less than the standard input rate. Gemini also supports automatic caching on repeated prefixes without explicit configuration.

The practical impact depends on your usage pattern. If every request is unique with no shared prefix, caching does nothing. But if you are building a chatbot with a long system prompt, doing RAG with the same document across multiple queries, or running an agent that makes repeated calls with growing context, caching can cut your input costs by 50-90% depending on the provider. That is why it matters to log cached token counts separately: if your monitoring shows cache hit rates dropping, your costs are going up even if your traffic stays flat.

Logging and monitoring your token usage

At this point we know how to count tokens, how to read them from API responses and how different models and reasoning levels affect the cost. The missing piece is actually tracking this in production so you can answer questions like "how much did we spend on Claude last week" or "which endpoint is burning through tokens fastest."

The core idea is simple: every time you make an API call, log the usage metadata somewhere. The minimum you want to capture is the timestamp, provider, model name, input tokens, output tokens and reasoning/thinking tokens if applicable. If you are running multiple features or agents, add a tag or label so you can break down usage per use case.

Where you store this depends on your scale. For a side project or POC a SQLite database or even a CSV file is enough. For a production service you probably want PostgreSQL or a data warehouse like BigQuery or Snowflake where your data team can query it alongside other operational metrics. Some teams just push token usage into their existing logging pipeline, whether that is Datadog, CloudWatch or whatever they already use for application metrics. The storage choice does not matter much as long as the data gets captured consistently on every call.

Do not calculate costs in your application code. Providers change their prices and if you hardcode rates into your logger you need to redeploy every time that happens. Instead log the raw token counts and the model name, then maintain a pricing table in your database or BI tool that maps each model to its current input and output rates. When prices change you update one table and all your historical reports recalculate automatically. Just make sure your pricing table accounts for the different token types separately: input tokens, output tokens, reasoning/thinking tokens (billed at the output rate) and cached tokens (billed at a discounted rate) all have different prices.

For visualization you have a range of options. The simplest is a Jupyter notebook with pandas where you query your logs, group by model or by day and plot the costs. One step up is a dashboard in your BI tool like Metabase, Superset, Looker or Tableau connected to whatever database you chose. If your team already runs Grafana for infrastructure monitoring, adding a token usage dashboard there makes sense because the people who care about API costs are often the same people watching latency and error rates. Some teams build a simple internal page with Streamlit or Retool that shows daily spend per model and alerts when costs spike. There are also third-party platforms built specifically for LLM observability like Helicone, LangSmith and Portkey that sit as a proxy around your API calls and capture token usage, latency and costs automatically.

The important thing is that you log something. The exact tool does not matter. What matters is that when someone asks "how much does this feature cost us in API calls" you have a real number instead of a guess.

So yeah, you can count tokens

Tokens are not random and they are not hard to track. Every provider returns them in the API response, you just need to log them somewhere and build a pricing table in your BI. If someone on Twitter tells you otherwise, now you have the receipts.

Subscribe on datobra.com to not miss new posts. Updates: @ohthatdatagirl.

SQL Interviews in the Age of LLMs: Patterns Over Queries

Olga Braginskaya — Mon, 23 Mar 2026 12:23:17 +0000

At this point most of us don't really write SQL from scratch anymore. We describe what we need, tweak a prompt, maybe adjust a few lines and move on, because the query will get written anyway and the job still gets done.

Interviews, however, seem to have chosen stability over evolution. You are handed a schema, a problem statement and a blank editor and the expectation is that you will reconstruct a correct query on the spot, calmly and without external help, as if this remains the default way SQL is produced in real projects rather than something we mostly stopped doing years ago. It feels slightly time-shifted, but it is also the format that continues to decide who passes and who does not.

Complaining about it does not change much, so the more practical question is how to prepare without turning it into an exercise in memorizing fifty unrelated queries that only make sense in isolation. The reassuring part is that most of these problems are not random at all. They fall into a small number of recurring patterns and once you start recognizing them, the task shifts from remembering syntax to recognizing structure: grouping, picking one row per entity, comparing sequential records, checking whether something exists, stitching start and stop events, finding streaks, detecting overlapping ranges.

Even if you do not remember the exact syntax, you usually know the direction and direction is far more valuable in a blank editor than perfect recall of keywords in a world where syntax is the easiest thing to generate.

What "SQL patterns" actually mean

When I talk about SQL patterns, I do not mean memorizing functions or collecting a mental list of keywords. A pattern is not ROW_NUMBER or CASE or EXISTS. It is a recurring way of looking at a problem and recognizing its shape before you even think about the exact syntax.

When you read a task carefully, certain signals tend to appear. Words like "latest", "per user", "previous", "missing", "at least N" are rarely accidental. They usually point to a specific class of solution, even if the schema and business context change every time.

"Latest per user" is almost never just aggregation; it usually means ranking within a group and selecting one row. "Previous value" is not about grouping at all; it is about sequential comparison and ordered data. "Users without orders" is not really about joining tables for the sake of it; it is about checking whether something exists or does not exist.

That is what I mean by patterns. Most SQL interview problems fall into a relatively small set of these structures. Real problems often combine two of them, like aggregation plus existence or ranking plus conditional logic, but once you can see the individual shapes, the combinations stop being scary.

One more thing before we start. Throughout this article I use CTEs (the WITH ... AS syntax) to break queries into readable steps. If you have not used them before, a CTE is just a named subquery that you define at the top and reference below. It does not change what the query does, it just makes it easier to read. Think of it as giving a name to an intermediate result so you do not have to nest everything inside everything else.

Let's look at the patterns.

1. Aggregation

Aggregation is probably the most common pattern you will see and also the one that looks misleadingly simple. At its core it is about collapsing multiple rows into a single result per entity and making a decision at that level.

The signals are usually straightforward. Phrases like "for each user," "per department," "per product" or anything that sounds like "how many" or "how much" are almost always pointing in this direction. If the problem can be rephrased as "for each X, calculate Y" you are very likely dealing with aggregation.

What changes from task to task is the story. Sometimes you are counting reports, sometimes orders, sometimes transactions, sometimes computing averages or sums. What does not change is the approach: define the grouping key, compute an aggregate and possibly filter based on that aggregate. That means GROUP BY, one of the aggregate functions (COUNT, SUM, AVG, etc.), and HAVING when the filter applies to the result of the aggregation rather than to individual rows.

Let's take a classic example from the LeetCode Top SQL 50 list, Managers with at Least 5 Direct Reports.

It asks you to find managers who have at least five people reporting to them. The signal here is "at least 5 direct reports," which is a count with a threshold, so aggregation with a HAVING clause. The grouping key is managerId, because you want to count how many employees share the same manager.

SELECT e2.name
FROM (
    SELECT managerId, COUNT(*) AS c
    FROM Employee
    GROUP BY managerId
    HAVING c >= 5
) res
INNER JOIN Employee e2
    ON res.managerId = e2.id;

The subquery groups employees by their manager, counts each group, keeps only those with five or more, and the outer join brings back the manager's name. Group, count, filter, look up the detail you need.

Common mistakes to watch for: trying to return columns that are not part of the group, using WHERE when you need HAVING (remember, WHERE filters rows before grouping, HAVING filters after) and misidentifying what the grouping key actually is, which usually means you misread what "per entity" means in the problem.

If you want more practice with the same pattern, try Find Followers Count and Number of Unique Subjects Taught by Each Teacher, both are pure single table aggregation with no distractions. For a trickier variation, Customers Who Bought All Products uses the same GROUP BY + HAVING structure but compares against a subquery to check that a customer has every product, not just a fixed threshold.

2. Conditional Aggregation

Conditional aggregation is what happens when plain aggregation is not enough because you need to split the result by some condition without splitting the query itself. Instead of writing separate queries for each category and combining them, you compute multiple metrics in one pass. By the way, this is also how you pivot rows into columns in databases that do not have a dedicated PIVOT keyword.

The signals are tasks that ask for several counts or sums side by side, usually broken down by status, type or category. "Count how many approved and how many rejected," "total revenue from domestic vs international," "number of completed orders and number of cancelled orders per user." Whenever you see a problem that wants multiple metrics from the same table grouped the same way but filtered differently, you are looking at conditional aggregation.

The approach is wrapping CASE WHEN inside your aggregate functions. Instead of filtering rows out, you tell each aggregate which rows to care about. SUM(CASE WHEN state = 'approved' THEN 1 ELSE 0 END) counts only approved rows, while COUNT(*) still counts everything. Same grouping, same pass, different conditions.

A good example from the LeetCode Top SQL 50 list is Monthly Transactions I.

The task asks you to report for each month and country the total number of transactions, the number of approved ones, the total amount and the approved amount. The signal is that you need both "all transactions" and "only approved" metrics in the same result. That is two different filters on the same grouping, which is conditional aggregation.

SELECT
    DATE_FORMAT(trans_date, '%Y-%m') AS month,
    country,
    COUNT(*) AS trans_count,
    SUM(CASE WHEN state = 'approved' THEN 1 ELSE 0 END) AS approved_count,
    SUM(amount) AS trans_total_amount,
    SUM(CASE WHEN state = 'approved' THEN amount ELSE 0 END) AS approved_total_amount
FROM Transactions
GROUP BY month, country;

You group by month and country, then use unconditional aggregates for the totals and conditional ones for the approved subset. One query, one scan, all four metrics at once.

Common mistakes to watch for: writing separate queries or subqueries for each metric when a single CASE WHEN inside the aggregate would do, forgetting the ELSE 0 (which can introduce NULLs that quietly break your sums) and confusing this pattern with WHERE filtering, which would remove the rows you still need for the total counts.

If you want more practice, try Queries Quality and Percentage and Count Salary Categories, both use conditional logic inside aggregates on a single table.

3. Top N Per Group (Ranking)

This is the pattern people confuse with aggregation most often and the difference matters. Aggregation collapses a group into one number. Ranking keeps the actual rows but picks which ones you want from each group.

The signals are phrases like "latest," "highest," "most recent," "top 3," combined with a per entity qualifier like "per user," "per department," "per category." The key tell is that the problem wants you to return full rows, not just a count or a sum. If someone asks for "the highest salary per department," they do not want the number, they want the employee. That is ranking, not aggregation.

The approach is window functions: ROW_NUMBER, RANK or DENSE_RANK partitioned by the group and ordered by whatever defines "top." You compute the rank in a subquery or CTE, then filter on it in the outer query. The choice between the three functions depends on how you want to handle ties. ROW_NUMBER gives exactly one row per position regardless of ties. RANK leaves gaps after ties (1, 1, 3). DENSE_RANK does not leave gaps (1, 1, 2), which is what you usually want when the problem says "top 3" and means "top 3 distinct values."

A good example is Department Top Three Salaries from the LeetCode Top SQL 50 list.

The task asks you to find employees who earn one of the top three unique salaries in their department. The signal is "top three" + "in each department" and the problem wants employee names and salaries back, not just numbers. That is ranking. Since "top three" here means three distinct salary levels, you need DENSE_RANK.

WITH RankedSalaries AS (
    SELECT
        e.name AS employee,
        e.salary,
        e.departmentId,
        DENSE_RANK() OVER (
            PARTITION BY e.departmentId
            ORDER BY e.salary DESC
        ) AS salary_rank
    FROM Employee e
)
SELECT
    d.name AS Department,
    r.employee,
    r.salary
FROM RankedSalaries r
JOIN Department d
    ON r.departmentId = d.id
WHERE r.salary_rank <= 3;

The CTE ranks every employee within their department by salary descending, then the outer query keeps only those with rank 3 or less and joins to get the department name.

Common mistakes to watch for: using ROW_NUMBER when the problem expects ties to be preserved (two people with the same salary should both appear), using RANK instead of DENSE_RANK when the problem says "top 3" and means three distinct values not three positional ranks and forgetting that the window function has to go in a subquery or CTE because you cannot filter on it directly in the same WHERE clause where it is computed.

If you want more practice, try Product Sales Analysis III, which asks for the first year of sales per product. It can be solved with a simple MIN subquery or with RANK() OVER (PARTITION BY product_id ORDER BY year), which makes it a good problem to compare both approaches. For something harder FAANG Stock Min-Max combines ranking with aggregation: you first compute monthly prices, then rank twice to find the highest and lowest per ticker.

4. Sequential Analysis (LAG / LEAD)

This pattern shows up whenever the problem asks you to compare a row with its neighbor. Not a different group, not an aggregate, but the row right before or right after in some ordered sequence.

The signals are words like "previous," "next," "change," "difference," "consecutive" or anything that implies ordered comparison. Time based data is the most common context: "compared to yesterday," "change from last month," "three consecutive days." But it also applies to any ordered sequence, like consecutive IDs or ranked entries.

The approach is LAG and LEAD window functions. LAG gives you access to the previous row's value, LEAD gives you the next one. You define the order with ORDER BY inside the window and then you can compare, subtract or check conditions between the current row and its neighbor. The result usually goes into a subquery or CTE because you need to compute the shifted value first and then filter on it.

A good example is Consecutive Numbers from the LeetCode Top SQL 50 list.

The task asks you to find all numbers that appear at least three times consecutively. The signal is "consecutive," which means you need to look at neighbors in sequence, not count occurrences globally. This is not aggregation. You need to check that a row's value equals both its previous and its next value.

WITH cte AS (
    SELECT
        num,
        LEAD(num) OVER (ORDER BY id) AS next,
        LAG(num) OVER (ORDER BY id) AS prev
    FROM Logs
)
SELECT DISTINCT num AS ConsecutiveNums
FROM cte
WHERE num = next AND num = prev;

The CTE adds two columns to each row: the next value and the previous value. The outer query keeps only rows where all three match, meaning the number appears at least three times in a row. DISTINCT handles cases where a number appears consecutively more than three times.

Common mistakes to watch for: forgetting that LAG and LEAD return NULL for the first and last rows respectively (which can break comparisons if you do not account for it), ordering by the wrong column (the order has to reflect the actual sequence, not just any column) and trying to solve sequential problems with self joins when a window function would be simpler and more readable.

If you want more practice, try Rising Temperature, which asks you to find days where the temperature was higher than the previous day. It is a simpler version of the same pattern, just two consecutive rows instead of three and worth trying both with LAG and with a self join to see which reads better. For something harder Repeated Payments uses LAG partitioned by three columns to detect duplicate credit card charges within a 10 minute window.

5. Event Pairing

This is my favorite pattern to ask about in interviews because it tests whether someone can recognize that rows in a table are not always independent records. Sometimes two rows are two halves of the same event and the real information only appears when you stitch them together.

The signals are tables where each entity has multiple rows with a status or event type column: "start" and "stop," "open" and "close," "login" and "logout." The problem then asks for a duration, a gap or a total time. Whenever you see a table that logs state changes and the question asks about the time between them, you are looking at event pairing. This uses the same tool as sequential analysis (chapter 4) but solves a fundamentally different kind of problem: instead of comparing neighbors, you are combining them into one record.

The approach is LEAD (or LAG): for each "start" row, grab the next row's timestamp to find the matching stop. Partition by the entity, order by time and you have your pairs.

A good example is Average Time of Process per Machine.

The task asks you to compute the average processing time per machine, where each process has a "start" and "end" row. The signal is a table with activity_type being either "start" or "end," and the question asking for time between them. Two rows, one event.

WITH paired AS (
    SELECT
        machine_id,
        activity_type,
        timestamp,
        LEAD(timestamp) OVER (
            PARTITION BY machine_id, process_id
            ORDER BY timestamp
        ) AS end_time
    FROM Activity
)
SELECT
    machine_id,
    ROUND(AVG(end_time - timestamp), 3) AS processing_time
FROM paired
WHERE activity_type = 'start'
GROUP BY machine_id;

The CTE uses LEAD to attach the next timestamp to each row within the same machine and process. The outer query filters for "start" rows only, so end_time is the matching stop, then averages the difference per machine.

This problem can also be solved with conditional aggregation (MAX(CASE WHEN 'start' ...) and MAX(CASE WHEN 'end' ...) grouped by process), which works well when start and stop share an explicit key. But LEAD is the more general tool, especially when rows just alternate in order and there is no shared key tying them together.

Common mistakes to watch for: assuming that start and stop always alternate perfectly (real data has gaps, missing events and duplicates), forgetting to filter for only "start" rows after using LEAD (otherwise you also pair stop with the next start) and using the wrong partition, which pairs events from different entities together.

6. Gaps and Islands

This pattern is about finding streaks in data, consecutive runs of rows that share some property, where the grouping is not explicit anywhere in the table. Nobody marked where a streak starts or ends. You have to discover it from the sequence itself.

The signals are words like "consecutive," "streak," "in a row," "continuous," "uninterrupted." The problem gives you ordered data and asks you to find groups of rows that form an unbroken chain: consecutive days of activity, consecutive years of filing, consecutive months of subscription. The difference from sequential analysis (chapter 4) is that you are not just comparing neighbors, you are identifying entire groups of consecutive rows and measuring or filtering them.

The approach is a classic technique that looks like a trick the first time you see it but becomes second nature quickly. The idea is: if you have a sequence of consecutive values (like dates or years) and you subtract a ROW_NUMBER from each value, the result is the same for all rows in the same consecutive run and different when there is a gap. That constant becomes your group identifier.

Think of it this way. If a user made purchases on days 5, 6, 7, 10, 11, the ROW_NUMBER within that user would be 1, 2, 3, 4, 5. Subtract ROW_NUMBER from the day: 5−1=4, 6−2=4, 7−3=4, 10−4=6, 11−5=6. The first streak gets group 4, the second gets group 6. The actual numbers do not matter, what matters is that consecutive rows produce the same group value.

A good example is User Shopping Sprees.

The task asks you to find users who made purchases on 3 or more consecutive days. The signal is "3 or more consecutive days," which is exactly island detection: find streaks of consecutive dates and check their length.

WITH daily_purchases AS (
    SELECT DISTINCT
        user_id,
        transaction_date::date AS purchase_date
    FROM transactions
),
islands AS (
    SELECT
        user_id,
        purchase_date,
        purchase_date - ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY purchase_date
        )::int AS grp
    FROM daily_purchases
)
SELECT DISTINCT user_id
FROM islands
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id;

The first CTE deduplicates to one row per user per day, since a user might make multiple purchases on the same day and you care about distinct days. The second CTE applies the classic trick: purchase_date - ROW_NUMBER() produces the same value for consecutive dates, giving each streak its own group identifier. The outer query groups by user and island, keeps only streaks of 3 or more days and returns the distinct user IDs.

Common mistakes to watch for: forgetting to deduplicate before applying ROW_NUMBER (duplicate dates break the consecutive subtraction trick), not partitioning ROW_NUMBER by the right entity (which merges streaks from different users into one sequence).

If you want more practice, try Consecutive Filing Years on DataLemur, which is the same pattern with years instead of dates and adds a product filter on top.

7. Deduplication (Picking One Row)

This pattern overlaps with ranking (chapter 3) but the intent is different. With ranking you are selecting the top entries from a group. With deduplication you are cleaning up data that should not have multiple rows in the first place.

The signals are words like "duplicate," "remove," "keep only one," "unique per user" or "latest record." Sometimes the problem explicitly says delete. Sometimes it asks you to return a result as if duplicates did not exist. Either way the core task is the same: define what makes two rows "the same," decide which one survives, and get rid of the rest.

The approach is ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...). You partition by whatever defines a duplicate group, order by whatever decides which row wins, and then either keep rn = 1 in a SELECT or delete everything that is not rn = 1. The same mechanic works for both reading and cleaning.

A good example is Delete Duplicate Emails from the LeetCode Top SQL 50 list.

The task asks you to delete all duplicate emails, keeping only the row with the smallest ID for each email. The signal is right in the title: "delete duplicate." The grouping key is email, and the tie breaker is ID.

DELETE FROM Person
WHERE id NOT IN (
    SELECT id FROM (
        SELECT
            id,
            ROW_NUMBER() OVER (
                PARTITION BY email
                ORDER BY id
            ) AS rn
        FROM Person
    ) ranked
    WHERE rn = 1
);

The inner query partitions by email, orders by ID, and assigns a row number. The middle query keeps only rn = 1, which is the smallest ID per email. The outer DELETE removes everything else. If this were a SELECT problem instead of a DELETE, you would just use the inner two layers and return the result directly.

This can also be solved with a self join (DELETE p FROM Person p, Person q WHERE p.email = q.email AND p.id > q.id) which is shorter but less general. The ROW_NUMBER approach scales better because if the problem changes from "smallest ID" to "most recent timestamp" you just change the ORDER BY, and if it changes from DELETE to SELECT you just drop the outer layer.

Common mistakes to watch for: reaching for DISTINCT when the problem needs you to actually choose which row to keep (DISTINCT deduplicates the output but gives you no control over which row's data you get), not defining the ordering properly so you keep a random row instead of the one the problem specifies and forgetting about tie breaking when two rows are identical on the ordering column as well.

8. Existence / Anti-Existence

This pattern is about checking whether something exists or does not exist in a related table, without actually needing any data from that table. You are not joining to pull columns. You are joining to answer a yes or no question.

The signals are phrases like "users without orders," "never purchased," "did not make," "at least one" or "has no matching." Whenever the problem is asking you to filter one table based on the presence or absence of rows in another table, you are looking at existence logic.

The approach has two common forms. The first is EXISTS / NOT EXISTS with a correlated subquery. The second is LEFT JOIN + IS NULL, where you join to the related table and then filter for rows where the join found nothing. Both do the same thing. LEFT JOIN is often more intuitive for people who think in terms of joins, while EXISTS can be more readable when the condition is complex.

A good example is Customer Who Visited but Did Not Make Any Transactions.

The task asks you to find customers who visited but did not make any transaction during that visit and count how many times that happened. The signal is "visited but did not make any transactions," which is classic anti-existence: you want visits where no matching transaction exists.

SELECT customer_id, COUNT(*) AS count_no_trans
FROM Visits v
LEFT JOIN Transactions t
    ON t.visit_id = v.visit_id
WHERE transaction_id IS NULL
GROUP BY customer_id;

The LEFT JOIN keeps all visits regardless of whether a transaction exists. The WHERE transaction_id IS NULL filters down to only the visits with no match. Then you group by customer and count. The same result could also be written with NOT EXISTS:

SELECT customer_id, COUNT(*) AS count_no_trans
FROM Visits v
WHERE NOT EXISTS (
    SELECT 1 FROM Transactions t
    WHERE t.visit_id = v.visit_id
)
GROUP BY customer_id;

Both are valid. Pick whichever reads more naturally to you in the moment.

Common mistakes to watch for: using INNER JOIN when you need anti-existence (which drops exactly the rows you are looking for), using NOT IN with a subquery that can return NULLs (if any value in the subquery result is NULL, NOT IN returns nothing, which is one of the most subtle bugs in SQL) and confusing this pattern with a regular join when the problem does not actually need any columns from the second table.

If you want more practice, try Employees Whose Manager Left the Company, where you need to find employees whose manager ID does not exist in the employees table anymore. For something harder Reactivated Users applies the same anti-existence logic to time series: find users who logged in this month but did not log in the previous month.

9. Self-Join / Pairwise Comparison

This pattern comes up when all the data lives in one table and you need to compare rows from that table against each other. There is no second table to join to. The relationship is inside the dataset itself.

The signals are problems where entities in a table reference other entities in the same table: employees and their managers, friends and friend requests, records that need to be compared with other records from the same source. Anything where the problem says "find pairs," "compare with other rows in the same table" or where a column like manager_id or reports_to points back to the same table's primary key.

The approach is joining the table to itself with aliases. You treat one copy as the "main" row and the other as the "related" row and the join condition defines the relationship between them. The key is being precise about that condition, because a sloppy self join can easily produce duplicate pairs or cartesian explosions.

This might look similar to sequential analysis (chapter 4) since both compare rows within the same table. The difference is the type of relationship. LAG/LEAD works when rows are neighbors in an ordered sequence. Self join works when rows are connected by a key or condition that has nothing to do with ordering. You cannot solve "find each employee's manager" with LAG because there is no ordering where the next row is your manager. The connection is hierarchical, not positional, and a self join is the only way to follow it.

A good example is The Number of Employees Which Report to Each Employee.

The task asks you to find, for each manager, how many employees report to them and the average age of those employees. The signal is reports_to pointing back to employee_id in the same table. One table, two roles: the employee and their manager.

SELECT
    m.employee_id,
    m.name,
    COUNT(e.employee_id) AS reports_count,
    ROUND(AVG(e.age)) AS average_age
FROM Employees e
JOIN Employees m
    ON e.reports_to = m.employee_id
GROUP BY m.employee_id, m.name
ORDER BY m.employee_id;

One copy of the table (e) represents employees, the other (m) represents managers. The join connects each employee to their manager through the reports_to foreign key, which is a hierarchical relationship that cannot be solved with LAG/LEAD because there is no ordering where the next row happens to be your manager. From there it is just aggregation: count the reports, average their age, group by manager.

Common mistakes to watch for: forgetting to alias the table differently on each side (which makes the query ambiguous), using the wrong join direction so you get employees without reports instead of managers with reports and not accounting for rows that match themselves when the join condition allows it, which can inflate counts or create false pairs.

If you want more practice, try Rising Temperature, which joins the Weather table to itself to compare each day's temperature with the previous day. It shows self join used for sequential comparison rather than hierarchical relationships.

10. Cartesian Expansion (CROSS JOIN)

This pattern shows up when the data you need does not exist yet. The table has records of what happened, but the problem wants you to report on every possible combination, including the ones where nothing happened. You need to generate the full matrix first and then fill in the actuals.

The signals are problems that expect rows in the output for combinations that have no data. "Show attendance for every student in every subject" means you need a row even if a student never took that subject. "Revenue per product per month" means you need a row even for months with zero sales. Whenever the output should include missing combinations with zeroes or NULLs, you are looking at a CROSS JOIN.

The approach is to generate all possible combinations first with CROSS JOIN, then LEFT JOIN to the actual data to fill in what exists. The CROSS JOIN builds the skeleton, the LEFT JOIN adds the flesh, and anything that did not match stays as zero or NULL.

A good example is Students and Examinations.

The task asks you to report how many times each student attended each exam. The catch is that every student should appear with every subject, even if they never took it. The signal is three tables (Students, Subjects, Examinations) where the output needs every student × subject pair, not just the ones with exam records.

SELECT
    s.student_id,
    s.student_name,
    sub.subject_name,
    COUNT(e.subject_name) AS attended_exams
FROM Students s
CROSS JOIN Subjects sub
LEFT JOIN Examinations e
    ON s.student_id = e.student_id
    AND sub.subject_name = e.subject_name
GROUP BY s.student_id, s.student_name, sub.subject_name
ORDER BY s.student_id, sub.subject_name;

The CROSS JOIN between Students and Subjects creates every possible student-subject pair. The LEFT JOIN to Examinations attaches actual attendance records where they exist. COUNT(e.subject_name) counts only the non-NULL matches, so students who never took a subject get zero. Without the CROSS JOIN, those zero rows would simply be missing from the output.

Common mistakes to watch for: not recognizing that you need a CROSS JOIN in the first place (many people try to solve this with just LEFT JOIN and wonder why combinations with no data are missing), exploding row counts by cross joining large tables without understanding how many combinations you are generating and using COUNT(*) instead of COUNT(column_from_left_joined_table), which counts the NULL rows as 1 instead of 0.

If you want more practice, try 3-Topping Pizzas on DataLemur, which cross joins a table to itself three times to generate all possible 3-topping combinations and uses < comparisons to eliminate duplicates and enforce alphabetical order in one step.

11. Interval Overlap

This pattern comes up when you have rows that represent ranges, usually time ranges and you need to find where they intersect. It is surprisingly common in real interviews, especially for companies that deal with scheduling, bookings or resource allocation.

The signals are problems that mention "overlapping," "conflicting," "at the same time," "double booked" or "concurrent." Whenever entities have a start and an end, and the question asks whether any of them collide, you are looking at interval overlap.

The approach relies on one condition that is worth memorizing: two intervals [a_start, a_end] and [b_start, b_end] overlap when a_start < b_end AND b_start < a_end. It is easier to think about it the other way: two intervals do not overlap when one ends before the other starts. The overlap condition is simply the negation of that.

Here is a problem to illustrate. Imagine a meeting_rooms table where each row is a booking:

meeting_rooms
+------------+------------+---------------------+---------------------+
| booking_id | room_id | start_time | end_time |
+------------+------------+---------------------+---------------------+
| 1 | A | 2024-03-01 09:00 | 2024-03-01 10:00 |
| 2 | A | 2024-03-01 09:30 | 2024-03-01 10:30 |
| 3 | A | 2024-03-01 11:00 | 2024-03-01 12:00 |
| 4 | B | 2024-03-01 09:00 | 2024-03-01 10:30 |
| 5 | B | 2024-03-01 10:00 | 2024-03-01 11:00 |
+------------+------------+---------------------+---------------------+

The task: find all pairs of bookings that conflict, meaning they are in the same room and their time ranges overlap.

SELECT
    a.booking_id AS booking_1,
    b.booking_id AS booking_2,
    a.room_id
FROM meeting_rooms a
JOIN meeting_rooms b
    ON a.room_id = b.room_id
    AND a.booking_id < b.booking_id
    AND a.start_time < b.end_time
    AND b.start_time < a.end_time;

The self join pairs every two bookings in the same room. a.booking_id < b.booking_id ensures you get each conflicting pair once instead of twice. The last two conditions are the overlap check: A starts before B ends, and B starts before A ends.

For room A, bookings 1 and 2 overlap (9:00–10:00 and 9:30–10:30). Booking 3 does not conflict with either because it starts at 11:00. For room B, bookings 4 and 5 overlap (9:00–10:30 and 10:00–11:00).

Common mistakes to watch for: getting the overlap condition wrong (people often check a_start BETWEEN b_start AND b_end which misses cases where A fully contains B), forgetting a.booking_id < b.booking_id and getting duplicate pairs or rows matching themselves, and using <= instead of < when the business rule says that a meeting ending at 10:00 and another starting at 10:00 do not conflict.

If you want to practice this pattern on real problems, User Concurrent Sessions and Merge Overlapping Events in the Same Hall are both good interval overlap exercises, though both require a paid subscription.

12. Running Totals / Cumulative Metrics

This pattern is about accumulating values as you move through an ordered sequence. You are not grouping rows into buckets like aggregation and you are not comparing with neighbors like LAG/LEAD. You are keeping a running count that grows with each row.

The signals are words like "running total," "cumulative," "so far," "up to this point," "as of each date." Sometimes the problem does not use those words directly but describes a threshold that gets checked row by row, which is a running total in disguise.

The approach is SUM() OVER (ORDER BY ...). The ORDER BY inside the window defines the sequence and the SUM accumulates as it moves through it. Add PARTITION BY if the accumulation resets per group.

A good example is Last Person to Fit in the Bus.

The task asks you to find the last person who can board a bus with a 1000 kg weight limit, where people board in order of their turn column. The problem does not say "running total" but that is exactly what it is: you accumulate weight person by person and stop when you hit the limit.

WITH cumulative AS (
    SELECT
        person_name,
        SUM(weight) OVER (ORDER BY turn) AS total_weight
    FROM Queue
)
SELECT person_name
FROM cumulative
WHERE total_weight <= 1000
ORDER BY total_weight DESC
LIMIT 1;

The CTE computes a running total of weight ordered by turn. The outer query keeps only rows where the total is still within the limit and picks the last one. No GROUP BY anywhere, because this is not aggregation. Each row retains its identity and gets a cumulative value attached to it.

Common mistakes to watch for: omitting the ORDER BY inside the window function (which makes the accumulation order undefined and the results unpredictable), confusing this with GROUP BY (which collapses rows, while a window function keeps them) and partitioning when you should not or vice versa, which either resets the running total too often or never resets it when it should.

If you want more practice, try Restaurant Growth, which asks for a rolling 7 day average of restaurant spending. It uses the same SUM() OVER (ORDER BY ...) mechanic but with a RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW frame, adding a layer of complexity on top of the basic running total.

Conclusion

You do not need to memorize twelve patterns. You need to solve enough problems that when you read a new one, something clicks and you think "I have seen this shape before." That is the whole point. The blank editor is not asking you to recall syntax. It is asking whether you know where you are going. If you do, the rest is just typing.

There are other patterns you will encounter, but most of them are combinations of these twelve. "Find employees earning above their department average" sounds like its own thing, but it is really just a window function (AVG() OVER (PARTITION BY department)) attached to each row and then filtered, which is running totals logic applied to a different aggregate. Pivoting rows into columns is just conditional aggregation with CASE WHEN inside SUM or COUNT. Once you have the building blocks, the combinations assemble themselves.

For quick reference, here is the cheat sheet:

Pattern	Signal	Core tool
Aggregation	"for each," "how many," "per user"	GROUP BY + HAVING
Conditional aggregation	multiple metrics, status breakdown	CASE WHEN inside SUM/COUNT
Top N per group	"latest," "highest," "top 3 per"	DENSE_RANK() OVER (PARTITION BY)
Sequential analysis	"previous," "next," "consecutive"	LAG / LEAD
Event pairing	"start/stop," "duration," "session"	LEAD partitioned by entity
Gaps and islands	"streak," "consecutive days," "in a row"	ROW_NUMBER subtraction trick
Deduplication	"duplicate," "keep one," "unique per"	ROW_NUMBER() WHERE rn = 1
Existence / anti-existence	"without," "never," "has no"	LEFT JOIN IS NULL / NOT EXISTS
Self-join	"same table reference," "reports to"	JOIN table to itself with aliases
Cartesian expansion	"all combinations," "missing pairs"	CROSS JOIN + LEFT JOIN
Interval overlap	"conflicting," "double booked"	a_start < b_end AND b_start < a_end
Running totals	"cumulative," "so far," "up to"	SUM() OVER (ORDER BY)

Subscribe on datobra.com to not miss new posts. Updates: @ohthatdatagirl.

AI Gave Us Lemons. We Picked Limoncello

Olga Braginskaya — Sat, 21 Feb 2026 09:57:27 +0000

Here's the thing. These days it's become genuinely hard to maintain professional motivation in software engineering. The level of bullshit is off the charts, everyone lies from companies to ChatGPT. AI rewrote this, AI replaced those people, AI this AI that blah blah blah. Today you're a lead engineer, people listen to you, tomorrow you're on the street under a bridge. Why? Well, random picked you. So what's the point of continuing to do engineering? Maybe I should not have listened to my mom and become a surgeon after all, but we are where we are.

AI really amplified this problem. Before you could at least stand tall on your skills, now that everyone is convinced three Claude Code agents can write any code, it's become very hard to prove you're something more than a useless bunch of cells, a layer between a manager and the real deal, the LLM. On one hand sure, we can separate work and life, you know, touch grass, do pilates, drink water. On the other hand the professional part of us is a big sore spot. Who am I? Another engineer who'll get tossed in the garbage? A cog in the machine? Even if you understand this is a transitional period between one stage of the industry and another, it's all incredibly demotivating. You lose all will to try.

Even before LLMs I noticed that it's very important to have some unkillable part of yourself that lying employers or a bad market can't take away from you, something that represents your professional self, a little piece of me in something. You know that dream about a cool pet project, or a blog, or teaching on the side? Why does that dream exist? Because a person needs some kind of foundation so that the next layoff or the next Super Cool Amazing Technology doesn't rip them out of the ground roots and all. Yeah, you can trample my garden and break my roof, but I have a basement full of canned food, I will actually survive. Yes, I'm a big fan of post-apocalyptic fiction.

And I have to say, despite everything, AI somehow improved access to building that basement. With all the quick solutions, cloud stuff and various deployment options, even a person who isn't super familiar with every stage of the process from idea to deployment can do it outside of work hours, build their own basement with canned food. On top of that it's an important part of professional fulfillment, setting yourself challenges like that, doing things you never get to try at work, seeing the full development cycle, plus trying to do your own marketing and sales. It's very sobering. It's hard to sell you bullshit and hard to sell yourself short after something like that.

In this article I want to share how we, a data engineer (me) and a full stack engineer (my brother, because who else would agree to this), built a cool (but of course completely unprofitable) project over three months, from idea to an actually running website with a product and what came out of it for us.

How It Started

Every developer at some point has that conversation with themselves where they go "I should really build something of my own" and then immediately open Netflix instead. We were no different, except this time there was a challenge involved and for some reason challenges work on engineers the way laser pointers work on cats.

We came across a challenge from Bright Data and n8n on Dev.to. If you're not familiar with Bright Data, the short version is: they give you access to the kind of data that big corporations created, collected, profited from and then hid behind walls from the very people who generated it in the first place. I'm a big fan of making things accessible and an even bigger fan of hating monopolies, so when I saw this my immediate reaction was "cool, let's do something with that."

The idea behind our Wykra is pretty simple to explain to a normal human being. Say you're a small brand or a marketer and you need to find creators on Instagram or TikTok who actually match what you're looking for. Like, vegan food bloggers in Portugal with 10 to 50 thousand followers. Right now your options are either scrolling for three hours or paying a platform that charges you like it's solving world hunger. We thought, okay, we can probably build something that does this, how hard can it be, famous last words obviously.

We submitted our challenge entry and then something unexpected happened: people actually liked it. Turns out I'm pretty good at selling an idea. The response was big enough that we looked at each other and said okay, this deserves more than a one-off post. That's how the whole build in public thing started. Because apparently the logical next step after "people liked our thing" is "let's commit to writing about it every week while working full time jobs, what could go wrong."

What It Actually Looks Like From the Inside

Here's something that happens when you build a side project as an engineer with a full time job: you suddenly discover that engineering is maybe 30% of the actual work.

We became our own product managers. What should this thing actually do? Who is it for? What's the MVP? These sound like obvious questions until you're the ones who have to answer them and there's no product person to argue with. We became our own QA. Which means we broke things, found the bugs, got angry at whoever wrote this code, remembered it was us and fixed them. We became our own designers. Please take a moment to appreciate the purple color #422e63 because I picked it and I'm unreasonably proud of it. We became our own marketers and let me tell you, this is where the real education happens. You can build the most elegant system in the world and then sit there watching your beautiful product get zero clicks because you wrote the landing page copy like engineers. Or, and this is my favorite example, you can launch your influencer discovery tool with GitHub login as the only authentication option. Because nothing says "welcome, marketing professionals" like asking them to log in with a developer account they don't have.

We also became our own researchers. Understanding the influencer marketing space, figuring out what people actually need versus what we assumed they need, reading about how competitors do things, learning that most of them are also kind of winging it. Very comforting honestly.

And then there's the build in public part. Week one: you write a post, it gets decent traction, people are interested, you hit top of the week on Dev.to, you feel like a startup founder giving a TED talk. Week two: still going strong, numbers look good, your motivation could power a small city. Week five: your cat is sick, you're tired, work was hell this week and you need to write something coherent about your progress but you haven't made any progress because life happened. Week seven: you skip an update and feel guilty about it like you missed a deadline at work except nobody is paying you for this. We kept a build in public series going for about eleven posts. The first two did well, after that the readership slowly settled into what I can only describe as a small dedicated group of people who clearly knew what they signed up for.

Somewhere around month two we almost quit. When your team is two people and one of them is your brother, every disagreement about architecture or priorities or "should we even keep doing this" carries about ten times more weight than it would with a coworker you can forget about after standup. There was a week where we barely spoke about the project and I genuinely thought that was it, we're done, this was a cute experiment and now it's over. The thing that pulled us back was embarrassingly simple: we'd already told people we were building this. The posts were out there, people were reading them - quitting quietly wasn't really an option anymore, and quitting publicly felt worse than just fixing whatever was broken and moving on.

The honest truth about build in public is that it requires either a wealthy uncle funding your free time or a very comfortable stock plan from a big tech company. For the rest of us it's a constant negotiation between wanting to share and wanting to sleep. But even with all that the discipline of having to explain what you're doing every week forces you to actually think about what you're doing. And that part is genuinely valuable.

The Practical Part

At some point you stop being a fun side project and start being a small business that spends actual money. You need a Google Workspace because you want to look like a real company when you send emails. You need API access to actually scrape social media platforms, which costs money because nobody lets you do that for free and for good reason. You need LLM calls for analysis, which means paying for models through something like OpenRouter. You need infrastructure to run all of this, which means a hosting platform. You need monitoring because things will break at 3am and finding out from an angry user is significantly worse than finding out from an alert on your phone.

I'm not going to drop exact numbers here, but I will say this: if you're smart about how you use LLMs, pick the right model for the right task instead of throwing GPT PRO at everything, cache aggressively and keep your architecture simple, the total cost of running something like this is surprisingly manageable. We're talking "a couple of nice dinners" territory, not "second mortgage" territory. The time cost is harder to quantify. Three months of evenings and weekends, some more intense than others. The average is probably somewhere around ten to fifteen hours a week if you spread it out, which sounds fine until you remember that those hours come after your actual job.

But say you're reading this and thinking okay, I actually want to try. The problem is that the gap between "I want to build something" and actually building it is enormous and most people live in that gap forever, thinking about it in the shower and then doing absolutely nothing about it. So let me walk you through how we got past it, because looking back it's less mysterious than it felt at the time.

Coming up with an idea is the part everyone overthinks. People sit around waiting for some brilliant original concept that will disrupt an industry and that waiting period conveniently never ends. Look at what annoys you, look at what annoys people around you, look at challenges and hackathons happening online. We literally found ours because Bright Data posted a challenge and we went "sure, why not." Your idea doesn't need to be revolutionary, it needs to be something you can explain in two sentences and something you care enough about to still want to work on at 10pm after eight hours of your actual job.

Starting is the part that feels impossible and is actually the simplest thing in the world. And here, ironically enough given everything I said in the introduction about AI making our lives miserable, is where LLMs actually become incredibly useful, but probably not in the way you think. Forget about asking them to write your code. Instead, open a chat and say "I want to build a calorie calculator but I have no idea where to start, be my coach." Tell it what you know, what you don't know, what scares you about the process and let it walk you through it step by step. Ask it to break down the project into pieces small enough that each one feels doable on a Tuesday evening. Ask it what to build first and what to ignore for now. The same technology that's causing all this professional existential dread turns out to be the best free project coach you've ever had and the universe clearly has a sense of humor about these things.

Keeping going is where it gets genuinely hard because the initial excitement wears off somewhere around week three and suddenly you're staring at your codebase on a Friday evening thinking about all the other things you could be doing with your life. If public accountability isn't your style, find a friend, a Discord server, a coworker, literally anyone who will periodically ask "so how's that project going" and make you feel just uncomfortable enough to continue.

Spending money is the moment where your brain starts negotiating with you. You need a domain, hosting, API access, some LLM credits and the voice in your head goes "wait, we're spending real money on something that might never earn a single dollar back, is this wise?" Push through that voice. The amounts involved are genuinely small if you're smart about it. Platforms like Render or Railway or Fly.io will host your thing for the price of two coffees a month. OpenRouter gives you access to LLMs without requiring a second mortgage. Cloudflare will sell you a domain for less than lunch. And honestly, a huge thank you to the companies that offer free tiers and there are some open-source models, because for people where money is the thing that blocks them from even starting, this matters more than those companies probably realize. I'm personally a big fan of Neon for databases and Streamlit for quick prototyping, both of which let you get surprisingly far without paying anything at all. You have absolutely spent more money on things that brought you less satisfaction, I can almost guarantee it.

Deploying used to be its own special circle of hell but in 2026 you can push your code to GitHub and have a live website with a real URL in minutes. Vercel, Railway, Render, pick whichever one you like, connect your repository, hit deploy and watch it happen. If you've never done this before it genuinely feels like magic the first time and the important thing is to do it early and do it ugly, because a running ugly thing that real humans can actually visit is infinitely more real than a beautiful polished masterpiece sitting on your localhost that nobody will ever see.

And that's it really. Idea, start, keep going, spend a little money, deploy. It sounds like a lot when you list it out but each individual step is something you can figure out in an evening and before you know it you have a real thing running on a real URL that you can show to real people. The whole process is less about talent or genius and more about stubbornness and refusing to stop when your brain is begging you to go do something easier. Which, if you think about it, is basically what engineering has always been.

Limoncello

So after three months of building, writing, debugging, designing, marketing, spending money and occasionally questioning our sanity, what did we actually get? A running product that nobody uses and the knowledge that we can do the whole thing from start to finish. Some of it well, some of it in a way we should probably never examine too closely, but all of it done.

We collected eight stars on our GitHub repository. Eight. I'm going to put that on my resume.

We also discovered that this resonates with people way more than we expected. The frustration, the desire to own something professionally, the need for that basement with cans. A lot of people feel this and a lot of them want to do something about it but get stuck in that gap between wanting and doing. The fact that we went from "this would be cool" to a real running website was apparently inspiring to some folks, which is both flattering and a little sad because it really shouldn't be that rare.

As for what's next, right now I'm packing a suitcase to fly to Japan for two weeks and air out my brain and after that we'll see. Our Instagram search is still not great, we still don't search other social networks or Google Maps, I'm still unhappy with the level of analytics we provide and I still need to learn the subtle art of attracting users without attracting the attention of a psychiatrist. We also have the option of launching on Product Hunt and Y Combinator has a round in April, so who knows.

But here's the part that matters more than any roadmap. The reality that made everything worse also handed us the tools to build something of our own. The same AI that threatens to replace us helps us prototype faster. The same cloud infrastructure that big companies use to run their empires is available to two people working after dinner. The same internet that's full of doom and gloom about engineering careers is also full of people who want to see you build something and will cheer you on while you do it.

You are going to get squeezed by this industry, that part is probably unavoidable. You're going to get handed lemons. The standard advice is to make lemonade, smile, be grateful, pivot, adapt. We chose limoncello instead, because it takes longer, it's more work, nobody asked for it and in the end you have something with a bit more kick to it. It will cost you your evenings, some of your money and a lot of stubbornness. The result might be completely unprofitable. But you'll end up with a basement full of cans and when the next storm comes, and it will, you'll know you can survive it.

If you want to dig around in our basement: github.com/wykra-io/wykra-api

And if you want to see what else I write about when I'm not making limoncello: datobra.com

Build in Public: Week 10. Making It Less Fragile

Olga Braginskaya — Tue, 10 Feb 2026 17:16:53 +0000

Build in public has a funny way of changing over time. In the beginning you write about ideas, architecture, big plans and bold assumptions. A few weeks later you mostly write about why something broke, how it broke and what you had to fix so it doesn’t break the same way again tomorrow. This week was very much the second kind.

Nothing fundamentally new appeared on the surface though internally a lot of small, slightly annoying, but very necessary things got cleaned up. The kind of work that doesn’t look impressive in a demo, but immediately shows up the moment a real user does something slightly unexpected.

One of the first things we realized is that long-running searches need a way to be stopped, even if everything is technically working as designed. When a search can take ten or twenty minutes, there’s always a moment where you understand you asked the wrong question or just don’t care anymore and want to move on. Until now the system had no real concept of cancellation, tasks would just run to completion because that’s what they were built to do. This week we added proper stop support, so ongoing work can actually be cancelled all the way down, including the Bright Data side.

We also spent some time fixing chat context handling. This wasn’t a single dramatic bug, more like a collection of small papercuts that made the system feel slightly confused in longer conversations. Things like answering a perfectly valid question, but with the wrong mental model of what the user was asking for. Not broken enough to scream about it, but broken enough to slowly erode trust. Those kinds of issues are hard to spot early and very obvious once you do.

Another realization came from the admin side. Different searches have very different costs and not everyone should be able to dial everything up to “expensive mode” just because they can. We added an Effort selector for Instagram searches, but only exposed it in the admin panel. Regular users stay on the cheaper, safer defaults. It’s one of those decisions that feels slightly boring until you start paying real bills.

Authentication was another area where reality hit. Up until now GitHub login was enough for us, mostly because engineers will happily click “Continue with GitHub” without thinking twice. But marketers, which are very much part of the target audience here, do not all have GitHub accounts. This week we added email + password login and wired up confirmation emails via Postmark. That part is currently waiting for manual approval on their side, which is a good reminder that not everything in the stack is instantly automatable, no matter how modern it claims to be.

While we were at it, we also added “Continue with Google”, because at some point you just accept that this button is table stakes. Not exciting, but necessary if you want people to actually get past the login screen.

We also discovered that the Telegram app wasn’t actually working. Not catastrophically broken, just broken enough. That’s fixed now and the Telegram flow is back where it should be.

Overall this was a very unglamorous week. The app got a bit more predictable, a bit cheaper to run and slightly less embarrassing when something goes wrong. At this stage, that feels like real progress.

If you want to see how all of this is wired together, the code is still here:

https://github.com/wykra-io/wykra-api

And if you’ve ever gone through this “everything works except it doesn’t” phase in your own projects you probably know exactly what this week felt like.

Build in Public: Week 9. The Shape of Wykra

Olga Braginskaya — Sat, 31 Jan 2026 14:05:31 +0000

Build in public is an interesting experiment overall. You get new readers, some of them even stick around, you start getting invited into different communities, you end up with a proud count of EIGHT stars on your repository and at some point you inevitably find yourself trying to fit into some LLM-related program just to get free credits and avoid burning through your own money too fast. I honestly think everyone should try something like this at least once, if only to understand how it actually feels from the inside.

At the same time there are obvious downsides. Writing updates every single week while having a full-time job requires a level of commitment that is harder to sustain than it sounds, because real life has a habit of getting in the way: a sick cat, a work emergency, getting sick yourself or just being too tired to produce something coherent. After a while it starts to feel uncomfortably close to a second job and I’ve had to admit that I’m probably not as committed to blogging as I initially thought I was. Honestly, keeping a build-in-public series going for more than a couple of months requires either a wealthy uncle or a very solid stock plan from a big company.

The work itself didn’t stop. Things kept moving, the system kept evolving and at some point it made sense to pause and do a proper recap of what we’ve actually been building. Yes, we skipped three weekly updates, but looking at the current state of the project, I’d say the result turned out pretty well.

What Wykra Does, In One Paragraph

Before getting into the details it’s worth briefly recalling how this started. Wykra began as a small side project built mostly for fun as part of a challenge, without any serious expectations or long-term plans, and somewhere along the way turned into whatever this is now. What it actually does: you tell it something like "vegan cooking creators in Portugal with 10k–50k followers on Instagram/Tiktok" and it goes hunting across Instagram and TikTok, scrapes whatever profiles it finds, throws them at a language model for analysis and gives you back a ranked list with scores and short explanations of why each profile ended up there. You can also just give it a specific username if you already have someone in mind and want to figure out whether they're actually worth reaching out to.

Since the original challenge post this has turned into a nine-post series on Dev.to and before moving on it's worth taking a quick look at how those posts actually performed.

As you can see the first two posts did pretty well and after that the numbers slowly went down. At this point the audience mostly consists of people who clearly know what they signed up for and decided to stay anyway.

What Users Actually See

At this point it makes more sense to stop talking and just show what this actually looks like now.

The first thing you hit is the landing page at wykra.io, which tries to explain what this thing is in about five seconds. I'm genuinely more proud of the fact that we have a landing page at all than of the fact that we even have a domain for email. Also please take a moment to appreciate this very nice purple color, #422e63, because honestly it's pretty great.

We also have a logo that Google Nano Banana generated for us, it’s basically connected profiles drawn as a graph, which is exactly what this thing is about.

After that you can sign up via GitHub because we still need some way to know who's using this and prevent someone from scraping a million dollars' worth of data in one go. Once you're in, you end up in a chat interface that keeps the full history and very openly tells you that searches can take a while, up to 20 minutes in some cases. Sadly there's no universe where this kind of discovery runs in five or six seconds. That's just how it works when you're chaining together web search, scraping and LLM calls.

Eventually you get back a list of profiles the system thinks are relevant, along with a score for each one and a short explanation of why it made the cut.

You can also ask for an analysis of a specific profile if you want to sanity-check whether someone is actually any good.

When you do that you get a quick read on what the account is actually about: the basic stats, a short summary written in human words and a few signals around niche, engagement and overall quality. It's not trying to pass final judgment on anyone, it just saves you from opening a dozen tabs and scrolling for twenty minutes to figure out whether a profile looks legit.

You can also use the whole thing directly in Telegram if the web version isn't your style. Same interface, same flows, just inside Telegram instead of a browser.

And Now, the Nerd Stuff

For anyone who cares about how this is actually put together, here’s the short version of the stack.

The backend is built with NestJS and TypeScript with PostgreSQL as the main database and Redis handling caching and job queues. For scraping Instagram and TikTok we use Bright Data, which takes care of the messy part of fetching profile data without us having to fight platforms directly. All LLM calls go through LangChain and OpenRouter, which lets us switch between different models without rewriting half the code every time we change our mind. Right now Gemini is the main workhorse and GPT with a web plugin handles discovery, but the whole point is that this setup stays flexible. Metrics are collected with Prometheus, visualized in Grafana and anything that breaks loudly enough ends up in Sentry.

The frontend is React 18 with TypeScript, built with Vite and deliberately boring when it comes to state management. Just hooks, no extra libraries. It also plugs into Telegram's Web App SDK, which is why the same interface works both in the browser and inside Telegram without us maintaining two separate apps.

For People Who Like Diagrams

If you're the kind of person who prefers one picture over five paragraphs of explanation, this part is for you. Below is a rough diagram of how Wykra is wired up right now. It's not meant to be pretty or final, just a way to see where things live and how data moves through the system.

If you trace a single request from top to bottom, you're basically watching what happens when someone types a message in the app: the API accepts it, long-running work gets pushed into queues, processors do their thing, external services get called, results get stored and errors get yelled about.

All LLM calls go through OpenRouter with Gemini 2.5 Flash doing most of the day-to-day work like profile analysis, context extraction and chat routing and GPT-5.2 with the web plugin used specifically for discovering Instagram profile URLs.

All LLM calls → OpenRouter API
    ├─ Gemini 2.5 Flash (primary workhorse)
    │ ├─ Profile analysis
    │ ├─ Context extraction
    │ └─ Chat routing
    │
    ├─ GPT-5.2 with web plugin
    │ └─ Instagram URL discovery

The Search Flow

Searching for creators on Instagram is a bit of a dance, because Bright Data can scrape profiles but doesn't let you search Instagram directly. So we first ask GPT with web search to find relevant profile URLs and only then scrape and analyze those profiles.

For TikTok things are simpler because Bright Data actually supports searching there directly. So we skip the whole "ask GPT to find URLs" step and just tell Bright Data what to look for.

So, How's It Going?

Honestly? Search doesn't work perfectly yet. Some results are great, some are questionable and there are edge cases where the system does something a bit странное. That's expected when you're stitching together web discovery, scraping and LLM analysis into one pipeline. Right now we're working on making the results more relevant and making the whole thing cheaper to run, because discovering creators should not feel like lighting money on fire.

But that's work for next week.

For now, if you want to dig into the code, everything lives here: https://github.com/wykra-io/wykra-api

And if you've made it all the way to the end and have thoughts, questions, or strong opinions about how this is built, feel free to share them. That's kind of the point of doing this in public.

Build in Public: Week 8. We Finally Deployed This Thing

Olga Braginskaya — Thu, 08 Jan 2026 19:53:22 +0000

Last week technically never happened.

We didn’t skip a post, didn’t disappear into Christmas and New Year food comas and definitely didn’t spend a suspicious amount of time eating baked goods instead of shipping software. Let’s assume we simply compressed time and released everything at once.

Because this week we finally deployed Wykra!

It’s deployed in exactly the state you’d expect at this stage. The UI is minimal, testing is uneven and some limits are deliberately strict. We spent more time than planned just getting everything wired together, but the system is now live, reachable and doing real work, which was the point.

What’s live now

The web UI (also used by the Telegram mini app):

https://app.wykra.io/

The Telegram Mini App:

https://t.me/wykra_bot

Authentication in the web UI is done via GitHub, while the Telegram mini app uses Telegram’s Web App data validation flow as described in the official documentation: https://core.telegram.org/bots/webapps#validating-data-received-via-the-mini-app. At the moment the API is protected by a fairly strict rate limiter, five requests per hour per API token, not because this is the ideal user experience, but because we want to observe how the system behaves under real usage, understand where the actual bottlenecks are and avoid discovering those limits the hard way.

There is also a known limitation on search at the moment, because of course there is. Full creator discovery is still being stabilized, so the UI currently exposes only profile analysis. After logging in you’ll land in a chat interface where you can ask for an analysis of a specific profile instead of running an open search. This keeps the surface area small while we validate the core analysis flow. This is an example of a profile analysis generated for a randomly chosen public account:

At this point you’re probably wondering how we actually deployed all of this, so let’s talk about that.

How this started (spoiler: with a domain)

The deployment push started in a very unglamorous way. After publishing the original challenge post

https://dev.to/olgabraginskaya/wykra-web-you-know-real-time-analysis-20i3

I bought a domain for the project wykra.io, for an almost embarrassing amount of money.

Why Railway and not AWS

At first we obviously wanted to do this like adults - AWS, Hetzner, a serious setup and a lot of infrastructure feelings. Then we remembered what stage this project is actually at and that we mostly want it to be live now, not perfectly architected sometime later.

So we went with something simpler and faster for the moment and chose a managed service like Railway. It let us deploy multiple services quickly and keep the focus on the product instead of turning infrastructure into another side project. I also genuinely enjoy how it automatically picks up repository changes and how clean the UI feels.

Deployment setup (the non-romantic version)

After logging into Railway via GitHub and approving the Railway app for the repository, we added the following services:

First, a Postgres volume. We use it as the main system of record: it stores users, chat history, profile analyses, search results and task state. Railway gives you both private networking for internal service-to-service communication and public networking for external TCP access if needed. We use private networking for everything inside the system.

We also added a Redis instance, mainly for caching and a few short-lived things that shouldn’t live in Postgres.

Then the core services:

wykra-api Built from the Dockerfile in the project root. All environment variables are configured directly in the Railway service, except database credentials, which are taken from the Postgres private networking configuration. This service is exposed via api.wykra.io.
wykra-web The React frontend used both for the web UI and the Telegram mini app. Built from the Dockerfile in /apps/web and exposed via app.wykra.io.
Grafana Built from /apps/grafana and exposed via grafana.wykra.io.
Prometheus Built from /monitoring/prometheus and used internally for metrics collection.

Grafana and Prometheus handle observability, we already wrote about why this matters and how we set it up in Week 5: https://dev.to/olgabraginskaya/build-in-public-week-5-the-week-we-finally-measured-things-instead-of-just-hoping-for-the-best-2kok

Domains, DNS and Things We Broke

Railway supports custom domains per service and gives you a neat DNS setup with a CNAME pointing to a *.up.railway.app address. This works perfectly fine unless your domain registrar is GoDaddy, which, as you’ve probably guessed, is exactly our case.

GoDaddy doesn’t support CNAME flattening or dynamic ALIAS records, so adding the record fails with a familiar “Record data is invalid” error. The recommended workaround (and the one we followed) is moving DNS management to Cloudflare. We switched the nameservers, added the domain there and configured the Railway CNAME records in Cloudflare instead. After that, everything became reachable.

Except for email.

We forgot to re-add Google Mail DNS records, which broke email for roughly three days, and the internal postmortem title involved DNS and poor life choices. I laughed for at least two hours, mostly at how we eventually figured out what the actual issue was, even if my brother didn’t find it nearly as funny.

No further comments.

Rate limiting (on purpose)

When you deploy something publicly that actively uses two paid APIs - Bright Data and OpenRouter - and you do it for free, there is a very natural moment where you stop and think about how not to accidentally burn all your money in a weekend.

That’s where rate limiting comes in.

The API uses a token-based rate limiting system implemented with the NestJS Throttler module, where each incoming request is tracked per API token. The token provided in the Authorization header is hashed using SHA-256 and then used as the rate-limiting key, so all requests made with the same token are counted together. In its current configuration, the system allows up to five requests per hour per token within a sixty-minute window, with counters stored in memory.

Rate limiting is applied globally through a custom guard registered as an APP_GUARD, which means it affects all routes by default. Once the limit is exceeded, the API responds with a 429 error and a clear message explaining why the request was rejected. Public routes are excluded from rate limiting and authenticated routes can explicitly opt out when needed.

Trying the API via Postman

For anyone who wants to poke the API directly, there is a Postman collection available here:

https://github.com/wykra-io/wykra-api/tree/main/postman-api

The flow looks like this:

Import the postman-api folder into Postman.
Create an environment variable apiUrl with the value https://wykra.io.
Generate a GitHub Personal Access Token from your GitHub settings.
Call the /api/v1/auth/githubAuth endpoint with that token as a Bearer token.
The response will contain a Wykra API token, which is subject to the five-requests-per-hour limit.
Use that token as the Authorization Bearer token for all other API endpoints.

Where this leaves us

Wykra is now deployed, with a domain, a UI, an API, a Telegram mini app and basic metrics and monitoring wired in. None of it is perfect, and yes, the current frontend still makes me wince a little in a very “students built this as a lab assignment” kind of way, but well, what can you do - that’s real developer life and there always will be another week.

If you want to support the project, starring the repo and following along helps more than you’d think:

Repo: https://github.com/wykra-io/wykra-api

Website: https://app.wykra.io/

Twitter/X: https://x.com/ohthatdatagirl

Blog: https://www.datobra.com/

Build in Public: Week 7. Shipping While Tired (or: Still Alive, Surprisingly)

Olga Braginskaya — Thu, 25 Dec 2025 17:38:40 +0000

This week was quiet.

Not because nothing is happening, but because after six weeks of pushing, adding platforms, adding metrics, debugging queues and arguing with reality, we’re tired. The kind of tired where you stop dreaming about new features and start dreaming about “everything still works tomorrow”.

And honestly that’s fine.

Week 7 ended up being a regrouping week. Less invention, more wiring things into something that feels like a system. The main theme was: if Wykra is about finding creators and understanding what’s real vs fake, then we need to look beyond follower counts and start inspecting the messier layer: comments. So we added suspicious comment analysis for both Instagram and TikTok.

The flow is the usual Wykra pattern: you submit a request, get a task id and the worker goes off to analyze a handful of recent posts/videos and comes back with structured results.

Instagram: suspicious comments

Request example:

curl --location 'http://localhost:3011/api/v1/instagram/profile/comments/suspicious' \
  --data '{ "profile": "annascooking_" }'

In this case the analysis found a very “classic internet” situation:

a single user dumping a burst of random Cyrillic characters in under a minute (high confidence bot/spam)
a smaller amount of generic emoji-only noise
overall: mostly organic discussion, with one very obvious spam cluster

"analysis": {
    "summary": "Analysis reveals several concerning patterns in the comment section. The most notable is a series of suspicious comments from user 'klepaskolia' who posted 14 consecutive comments containing seemingly random Cyrillic characters within a one-minute timespan, strongly indicating bot or automated spam activity. The majority of authentic comments appear to be in Polish and are food-related, making these outlier comments particularly conspicuous. Beyond the obvious spam cluster, the engagement patterns appear largely organic with normal food-related discussions and reactions. The comments show natural language variations and authentic interactions between users, with genuine questions about recipes and cooking techniques. While the spam incident is significant, it appears to be isolated to a single user and timeframe.",
    "suspiciousCount": 15,
    "suspiciousPercentage": 21.4,
    "riskLevel": "medium",
    "patterns": [
      {
        "type": "bot",
        "description": "Rapid-fire comments with random Cyrillic characters from single user",
        "examples": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
        "severity": "high"
      },
      {
        "type": "spam",
        "description": "Generic emoji-only comments with no context",
        "examples": [20, 28, 37, 42],
        "severity": "low"
      }
    ],
    "suspiciousComments": [
      {
        "commentIndex": 2,
        "commentId": "17981987252931776",
        "reason": "Part of automated spam sequence with random characters",
        "riskScore": 9
      },
      {
        "commentIndex": 13,
        "commentId": "18114796114575395",
        "reason": "Longest spam comment in sequence, random character string",
        "riskScore": 9
      },
      {
        "commentIndex": 37,
        "commentId": "17886914691398195",
        "reason": "Generic emoji-only comment with no context",
        "riskScore": 3
      }
    ],
    "recommendations": "Implement rate limiting to prevent rapid-fire commenting from single users. Add automated detection for comments containing random character strings, particularly in non-native alphabets. Consider requiring minimum character counts for comments to reduce low-effort emoji-only spam. Monitor accounts that post multiple times within very short time windows."
  }

Under the hood this starts with scraping real Instagram comments using Bright Data’s Instagram comments scraper. We first fetch the profile, extract a few recent post URLs and then collect comments for those posts.

Once we have the raw comments, we pass them to an LLM (Claude 3.5 Sonnet) with a fairly opinionated prompt. The goal isn’t to magically label comments as “fake”, but to ask very specific questions:

does this look like spam?
does this look automated?
are there weird engagement patterns?
are the same accounts behaving suspiciously across multiple comments?

The important part is that this works the same way across platforms. We scrape real data, normalize it, then ask the model to reason about patterns rather than individual comments in isolation.

If you’re curious about the actual prompts we’re using, they’re all in the repo.

TikTok: suspicious comments

Example:

curl --location 'http://localhost:3011/api/v1/tiktok/profile/comments/suspicious' \
  --data '{ "profile": "ddlovato" }'

This one was interesting in a different way. The comments were mostly real fan engagement, but the suspicious layer looked more like:

“free tickets / VIP access” patterns (scam-ish or at least engagement bait)
generic low-effort bot-shaped comments
some weird engagement distribution (generic comments getting unusually high likes)

"analysis": {
    "summary": "Analysis of the 150 comments reveals some concerning patterns of potential engagement manipulation and suspicious activity, though most comments appear authentic. The most notable suspicious pattern involves comments asking for free tickets/VIP access, which could be attempts to scam or manipulate engagement. There are also several generic, low-effort comments that show patterns consistent with bot activity. However, the majority of engagement appears organic, with fans expressing genuine enthusiasm about concerts, music, and personal connections to the artist. The high proportion of emoji usage and personalized responses suggests a largely authentic fan community. The presence of comments in multiple languages (English, Portuguese, Spanish) with consistent engagement patterns further supports genuine international fan interaction.",
    "suspiciousCount": 12,
    "suspiciousPercentage": 8,
    "riskLevel": "low",
    "patterns": [
      {
        "type": "ticket_scam",
        "description": "Multiple comments asking for free tickets/VIP access in similar patterns",
        "examples": [52, 77, 80],
        "severity": "medium"
      },
      {
        "type": "bot_activity",
        "description": "Very short, generic comments with minimal engagement",
        "examples": [22, 38, 18],
        "severity": "low"
      },
      {
        "type": "engagement_manipulation",
        "description": "Unusually high likes on generic comments compared to more substantive ones",
        "examples": [51, 52, 53],
        "severity": "medium"
      }
    ],
    "suspiciousComments": [
      {
        "commentIndex": 52,
        "commentId": "7584231130246038286",
        "reason": "Suspicious request for free VIP ticket with unusually high engagement",
        "riskScore": 7
      },
      {
        "commentIndex": 77,
        "commentId": "7584268771855434514",
        "reason": "Similar pattern of requesting free VIP ticket",
        "riskScore": 6
      },
      {
        "commentIndex": 22,
        "commentId": "7584891712965100296",
        "reason": "Extremely generic single-word comment with no context",
        "riskScore": 4
      }
    ],
    "recommendations": "Implement automated filtering for ticket request scams and monitor unusual engagement spikes on generic comments. Consider adding verification requirements for high-engagement comments and maintaining current emoji-friendly environment as it encourages authentic interaction. Monitor but don't restrict international language comments as they appear to represent genuine fan engagement."
  }

Under the hood this follows the same pattern as Instagram.

We scrape real TikTok data using Bright Data’s TikTok datasets: first the profile, then a handful of recent videos and then comments for each video. Everything runs asynchronously through the same task + worker flow as the rest of Wykra.

Once the comments are collected, we pass them to an LLM (Claude 3.5 Sonnet) with one important constraint: comments containing emojis are treated as normal, authentic engagement and explicitly excluded from suspicious analysis. That small rule turns out to matter a lot on TikTok.

Instead of flagging half the comment section as “low quality”, the analysis focuses on patterns that actually look off: repeated scams, automation-shaped behavior or engagement that doesn’t match the content.

Just like on Instagram the output is structured: a short summary, a risk level, detected patterns, concrete examples and recommendations. The goal isn’t to judge creators, but to separate real engagement from noise.

TikTok: profile analysis

We also now have TikTok profile analysis as a separate endpoint:

curl --location 'http://localhost:3011/api/v1/tiktok/profile' \
--data '{ "profile": "ddlovato" }'

This returns the boring but necessary foundation: followers, likes, engagement rates, verification, top videos, posting patterns, plus a sanity-check style analysis (does this look authentic, what niche/topic, visible brands, etc.).

"analysis": {
    "summary": "This is clearly Demi Lovato's official TikTok profile, showing strong performance metrics with 8.2M followers and 146.7M total likes. The bio indicates active music promotion ('IT'S NOT THAT DEEP') and tour marketing, aligning with their status as a major recording artist.",
    "qualityScore": 5,
    "topic": "Music & Entertainment",
    "niche": "Pop Music Artist",
    "sponsoredFrequency": "low",
    "contentAuthenticity": "authentic",
    "followerAuthenticity": "likely real",
    "visibleBrands": [
      "Self-branded music content",
      "Tour promotions"
    ],
    "engagementStrength": "strong",
    "postsAnalysis": "Content focuses on music promotion, behind-the-scenes content, and personal moments. High likes relative to follower count indicate strong engagement.",
    "hashtagsStatistics": "Limited hashtag usage, typical of verified celebrity accounts relying on organic reach."
  }

At this point we have enough building blocks: profile analysis, comment analysis, suspicious pattern detection, the same async task flow across platforms. Individually, they work. But taken one by one, they also make it easy to lose sight of why we’re building this in the first place.

That’s probably why this week felt so exhausting.

When everything lives behind localhost, the project starts to feel abstract. You’re improving parts, but you’re no longer seeing the whole. So we’re taking this as a signal that it’s time to deploy.

Not because Wykra is “done”, but because we need to see it running as an actual system, in front of real users, and get feedback that isn’t coming from our own assumptions or test profiles.

Still tired. Still going.

If you want to support the project, ⭐️ the repo and follow me on X - it really helps.

Repo: https://github.com/wykra-io/wykra-api

Twitter/X: https://x.com/ohthatdatagirl

Blog: https://www.datobra.com/

Build in Public: Week 6. Trying to Add More Social Platforms

Olga Braginskaya — Thu, 18 Dec 2025 15:05:20 +0000

Last week was about observability. We added metrics and dashboards so we could see what the system was actually doing instead of relying on intuition.

So this week wasn’t about inventing something new from scratch. It was about answering a very practical question: can we extend the same idea to other social platforms without the whole system falling apart?

Short answer: partially.

TikTok: Similar Problem, Different Shape

Quick reminder: Wykra is built to answer a very human question: “can you find me creators like this?” You describe the influencer you need and we go look for them. We already do this for Instagram. This week we tried to reuse the same pattern for other platforms starting with TikTok.

At a high level TikTok search follows the same story arc as Instagram: you send a free-text request like “Find up to 15 public TikTok creators from Portugal who post about baking or sourdough bread” and the API immediately hands you back a task id while the real work happens in the background. A worker picks up that task, turns your sentence into structured search parameters, runs a Bright Data dataset scrapper, scores the discovered profiles with an LLM, filters out the useless ones, and finally stores everything so you can fetch the results from /tasks/:id later.

The first step is still “vibe in, JSON out”. We send the original query to an LLM and ask it to extract a small context object: what niche this is about, which location it mentions, a normalized country code, an optional target number of creators and a few short phrases that could go straight into the TikTok search box. If the model cannot even agree on a category, we stop there instead of pretending we know what to search for. Once the context is ready, we build up to three search terms, pick a country (either from the context or defaulting to US) and move on.

This is where TikTok diverges from Instagram. For Instagram we have to use Perplexity to discover profiles first and only then enrich them. TikTok, thanks to having a proper keyword search in the dataset, lets us skip that extra step. For each search term we generate a TikTok search URL, trigger the Bright Data TikTok dataset with that URL and country, poll until the snapshot is ready, download the JSON and then merge and deduplicate all profiles by their profile URL. The whole thing can take a while, so it lives as a long-running async job inside the same generic Task system we already use elsewhere.

Once we have the raw profiles the LLM comes back in. For each profile we extract the basics (handle, profile URL, follower count, privacy, bio), send it together with the original query to the model and ask for a short summary, a quality score from 1 to 5 and a relevance percentage. Anything below 70 percent relevance is dropped; everything above is saved with its summary and score and linked to the task. The platform is different, but the pattern stays the same: structured context in the front, Bright Data in the middle and LLM scoring on the way out.

Example of a real request and an answer:

Tasks, metrics and a bit of discipline

All of this runs as a long-running background job attached to a single task id.

The task goes through a simple lifecycle:

pending → running → completed or failed

We store the task record and all TikTok profiles linked to it. When you fetch /tasks/:id, you see both the raw task status and the list of analyzed profiles. This turned out to be surprisingly helpful for debugging: if TikTok is empty but the task is completed, the problem is probably on the crawling or analysis side, not the queue.

Because we added observability last week, almost every step is also wrapped in metrics:

how many TikTok search tasks are created, completed, or failed,
how long they sit in the queue,
how long Bright Data calls take and how often they error out,
how many LLM calls we make and how expensive they are.

YouTube: the half-hour spinner of doom

TikTok was the success story this week. YouTube was the reminder that not everything is ready to be wired into Wykra, no matter how clean the architecture looks on paper.

We tried plugging in the YouTube dataset with a very gentle test:

{ "url": "https://www.youtube.com/results?search_query=sourdough+bread+new+york+", "country": "US", "transcription_language": ""}

In theory, this should behave a lot like TikTok: trigger crawl, wait, download JSON, move on with life. In practice, after ~30 minutes of spinning, the only thing we got back was:

{ "error": "Crawler error: Unexpected token '(', \"(function \"... is not valid JSON", "error_code": "crawl_error"}

So for now YouTube isn’t really plugged into Wykra at all: the dataset just spins, throws a crawler JSON error, and gives us nothing useful to store or analyze. We’ve opened a ticket with Bright Data and postponed YouTube until that’s sorted.

Threads: parameter present, logic absent

Threads got its own attempt too. The plan was simple: run a basic keyword-based discovery, something like:

{ "keyword": "technology"}

Instead of profiles, we got back:

{ "error": "Parse error: Cannot read properties of null (reading 'require')", "error_code": "parse_error"}

So the keyword parameter exists, the dataset exists, but the bit in the middle that’s supposed to connect them clearly doesn’t. For now we’re treating Threads the same way as YouTube: noted the issue and moved “proper Threads support” into the later bucket.

LinkedIn: same old story

LinkedIn has a similar limitation to Instagram: there is no nice keyword search for “find me people who talk about X from country Y”. You can load profiles and pages, but not in the way Wykra needs.

The conclusion for now is the same as with Instagram: if we want proper keyword-driven discovery, we’ll probably have to plug in a Perplexity/LLM-style search layer on top of LinkedIn as well, not just rely on the dataset.

That’s a problem for another week, but at least now it’s a clearly defined problem, not a vague feeling that “LinkedIn is weird”.

Week 6 was mostly about testing how far the existing pattern stretches across new platforms and where it breaks. TikTok more or less behaves, YouTube and Threads don’t, and LinkedIn clearly needs its own search layer on top of the dataset. For now that’s enough — better a couple of flows that work than five half-broken ones.

If you want to support the project, ⭐️ the repo and follow me on X, it really helps.

Repo: https://github.com/wykra-io/wykra-api

Twitter/X: https://x.com/ohthatdatagirl

Build in Public: Week 5. The Week We Finally Measured Things Instead of Just Hoping for the Best

Olga Braginskaya — Mon, 08 Dec 2025 16:41:03 +0000

Last week I ended with a dramatic cliffhanger: “We need metrics!”

And then immediately regretted it, because “metrics” is one of those words that sounds simple until you realize you now have to measure things сonsistently and over time. And then look at the numbers even when you don’t like them.

But here we are and Wykra officially has observability. We went with the classic open-source trio - Prometheus, Alertmanager, Grafana - the monitoring starter pack everyone eventually ends up when the fun part is over and your project starts behaving like something people might actually rely on.

Before we get into that a quick reminder for anyone who already lost the plot: Wykra is our AI agent that discovers and analyses influencers using Bright Data scraping and a couple of LLMs stitched together into one workflow. That’s the thing we’ve been building week by week, sometimes actually making progress, sometimes just banging our heads against the wall, but still moving forward.

What We Actually Added This Week

If you want the full setup: how to run everything, where the dashboards live, how to scrape the metrics, that’s all in the README.

Here I just want to show the main things we can finally measure and explain what’s actually doing the measuring. We use three tools, each with a very clear job:

Prometheus - our metrics store and query engine.

It fetches data from our API every few seconds and keeps track of all our counters and timings, so we can see how things change over time. This is essentially where all our HTTP, task and system metrics end up, and where we read them from when we want to understand what’s going on.

Alertmanager - the routing and notification layer.

Prometheus checks the alert rules, and when something crosses a threshold, Alertmanager sends the notification - Slack, email, webhooks, whatever we set up. It also groups and filters alerts so we don’t get spammed every time the system twitches.

Grafana - the visualization layer.

It sits on top of Prometheus and turns raw time-series data into dashboards we can monitor in real time. It’s where we track request rates, latency, task behaviour and system load without reading query output directly.

Together they cover everything we need for basic observability: Prometheus collects the data, Alertmanager sends the alerts and Grafana shows what’s happening.

The Core Metrics We Focus on

Even though Wykra isn’t handling real traffic yet and everything still runs inside Docker on our machines, having metrics already makes a huge difference. It lets us see how the system behaves under our own tests, load simulations and all the strange edge cases we manage to generate while building this thing.

There are plenty of metrics in Prometheus (the README has the full list), but the ones that actually help us understand what’s going on right now fall into 4 groups.

1. HTTP Metrics

These show how the API responds under our local runs: request rates, error rates and response times across all routes.

It’s an easy way to catch regressions, for example, when one change suddenly turns a fast endpoint into something that looks like it’s running through a VPN in Antarctica.

2. System Metrics

The basics: CPU, memory, process usage.

Even in Docker these tell useful stories such as sudden memory spikes, noisy CPU neighbours, inefficient code paths. When latency jumps, this is often where the explanation starts.

3. Task Pipeline Metrics

This is the part of Wykra that actually moves work through the system.

We track how many tasks we create during testing, how many complete or fail, how long they take and how the queue grows or drains over time. These metrics show whether the pipeline is behaving normally or slowly drifting into a backlog spiral.

We also collect latency distributions for specific task types (like Instagram search) to catch tail slowdowns that averages tend to hide.

4. External Service Metrics

Since the system relies heavily on external APIs we monitor them separately. They degrade differently from our own code and cause issues that look similar on the surface but require a different fix.

Bright Data metrics

Success rates, response times and error spikes for every Bright Data call.

This helps us see whether an issue comes from our code or from a day when the scraper ecosystem simply isn’t cooperating.

LLM call and token metrics

We also track how the LLMs behave under our test runs. The metrics cover call frequency, latency, token usage and error patterns, basically everything that tends to drift over time.

We record how many LLM calls we make, how long each one takes, how many prompt and completion tokens the model consumes and how that translates into total token usage per request. Errors are tracked separately so we can see when the model slows down, times out or starts returning bad responses.

Dashboards

I’m not going to paste the full Grafana board here (nobody needs 19 screenshots in a blog post) but here are a few core panels that demonstrate how the system behaves during our test runs.

The following panel shows the call rate for the two LLMs during our local test runs. Claude (green) peaks higher because it handles the heavier analysis steps, while Perplexity (yellow) stays lower and more steady. The small drops simply reflect pauses between test batches.

The chart below shows how token usage changes during our test runs. Claude’s prompt and completion tokens (green and blue) spike during the heavier analysis steps, which is why the total line (red) climbs sharply. Perplexity stays much lower. its queries are simpler and produce shorter responses. When the test batch ends, all token rates drop back to near zero until the next run.

You can also look the rate of Bright Data calls during our test runs. The spikes correspond to batches where we’re pulling Instagram profile data, and the flat sections reflect pauses between those batches.

The panel below lists all alert rules we’ve configured - errors, slow responses, resource spikes, LLM issues, database problems, and queue backlogs. Everything is green here because we’re only running test loads.

And then we have a dashboard that shows the basics: CPU usage rising during a test run, Instagram search tasks being created and completed at a steady rate and no failures during this window. This simple view is enough to confirm that the pipeline behaves as expected under local load.

Conclusion

The most useful thing we built this week is the ability to see what the system is actually doing. Instead of assuming everything works because a test passed once on my laptop, we now have real visibility: metrics, alerts, dashboards.

And now we can start expanding again: adding more social platforms, trying different search strategies, breaking things on purpose, because at least we’ll know how the system behaved before the change and whether the new idea made anything better or worse.

If you want to support the project, ⭐️ the repo and follow me on X, it really helps.

Repo: https://github.com/wykra-io/wykra-api

Twitter/X: https://x.com/ohthatdatagirl

Build in Public: Week 4. The Messy Middle Of Building An AI Agent

Olga Braginskaya — Mon, 01 Dec 2025 18:15:03 +0000

This was supposed to be the week of a polished demo video and a clean “here’s how you use our API” walkthrough. Instead, it turned into the week of staring at half-finished pieces, poking at logs and wondering why we voluntarily chose Instagram as our first supported social network.

If you remember last week’s post, I was feeling pretty optimistic about discovery methods and how you can mix different approaches. In theory it does work. In practice it works just enough to keep you going, but not nearly as well as you hoped when you first mapped it out and convinced yourself you’d cracked influencer search forever.

And this is exactly the part where motivation gets weird. Weekly posts sound great until you realize each week expects something polished, while the actual project is still a pile of experiments, half-successes and “why did the model hallucinate a bakery that literally does not exist?” moments. The LLMs get confused, APIs throw attitude and life is life. It’s surprisingly hard to keep shipping when the thing you’re building is technically working but also kind of fighting you at every step.

Another factor that complicates this stage is the dynamic of working on a side project as a partnership (even if it's your own brother). There isn’t a built-in structure around you, so the pace and direction depend entirely on the two of you. One week you’re perfectly aligned and the next you suddenly realize you’ve been solving slightly different problems or moving at different speeds. It’s a very different rhythm from a regular job, where roles and expectations are already defined. Here you have the freedom to shape everything yourselves, which also means you have to constantly realign even when both of you are tired or distracted.

But even in a week like this, things did move forward. Before getting into this week’s progress, it’s worth revisiting the idea that Wykra needs two distinct ways of handling creators. One mode is all about speed: give people a quick shortlist that matches their brief well enough to start browsing. The other focuses on depth: when someone finds a creator they care about, the system should be able to switch gears and produce a much richer, slower, more detailed analysis based on the full dataset.

With that in mind most of this week went into shaping the “quick” part - the end-to-end search flow. The agent now has a clear path from a natural-language request to a structured result:

A user sends a brief. Something human, messy and vague: a location, a niche, a follower range.
The system turns it into a task. The request gets dropped into a background queue so it can run independently of the interface.
A worker picks it up and interprets the brief. It sends the raw text to a context-extraction model using anthropic/claude-3.5-sonnet, which pulls out the useful bits - niche, geography, audience size - and turns them into structured signals.
The worker runs a first discovery pass. A strict prompt goes to perplexity/sonar-pro-search, asking only for real, verifiable Instagram profiles found through trustworthy external sources.
The system fetches profile data for the first pass. Instagram URLs from this strict pass get sent to Bright Data to pull actual profile snapshots.
The system checks the results. If there are too few valid accounts after this first pass, the worker switches to a broader fallback search.
The worker runs a second discovery pass. A more flexible prompt scans a wider part of the open web - websites, Linktree/Beacons, cross-linked socials, press mentions - still keeping only URLs tied to real profiles.
The system fetches profile data for any new profiles. Only Instagram URLs that didn’t appear in the first pass are sent to Bright Data to collect additional profile snapshots.
It merges and cleans the results. Duplicates are removed, broken links disappear and only valid accounts make it through.
A short analysis is generated. anthropic/claude-3.5-sonnet produces compact summaries and basic engagement signals.
The completed task is returned. The end result: a processed, ranked set of creators.

If I had to explain this quickly, I’d probably just draw it like this:

You can take a look at the actual processor code here: https://github.com/wykra-io/wykra-api/blob/main/src/instagram/instagram.processor.ts and if you follow the README (https://github.com/wykra-io/wykra-api/blob/main/README.md) you can even run the whole thing yourself. But if that sounds like too much effort, don’t worry I’ll just show you the videos.

First, here’s a quick walkthrough of how to spin up the project and get everything running locally.

Next, we trigger a search task by sending a curl request to /api/v1/instagram/search, grab the task ID from the response, and then check its status with another curl to /api/v1/tasks/{your_id} the video below shows exactly what that looks like.

curl --location 'http://localhost:3011/api/v1/instagram/search' \
--data '{"query": "Find up to 15 public Instagram accounts from Portugal who post about cooking and have not more than 50000 followers."}'


curl --location 'http://localhost:3011/api/v1/tasks/{your_id}'

The video shows exactly what this flow returns and I’ve copied the response below as well just so you can see that it does, in fact, return something reasonably shaped.

So that’s our first flow. Now let’s take a look at how the analysis flow works. Here’s the curl I used:

curl --location 'http://localhost:3011/api/v1/instagram/analysis?profile=baker_miss_by_carol'

The video below shows the call in action.

And here’s the response we got back.

As you can see, this isn’t exactly brain surgery, but it’s also far from perfect. There’s a lot left to improve: adding Google SERP is high on the list and I’ve been reading about exa.ai as well. I’m also considering a fallback where, if the prompt language doesn’t return anything useful, we automatically switch to the local language. Not every Lisbon pizza blogger writes in English, so it makes sense to ask in Portuguese when English comes up empty.

Overall, I can see us drifting into the testing phase (or the panic phase), which means it’s time to think about observability. We need proper logging for what we send and what we get back, and we definitely need some regression tests. Otherwise it’s impossible to tell whether adding Google SERP will actually help or quietly make everything worse. In short: the moment has arrived.

We need metrics!

If you want to support the project, feel free to ⭐️ the repo and follow me on X — it genuinely helps and keeps me motivated to keep building.

Repo: https://github.com/wykra-io/wykra-api

X: https://x.com/ohthatdatagirl

Build in Public: Week 3. First Survive Discovery, Then Enjoy Analysis

Olga Braginskaya — Mon, 24 Nov 2025 06:03:12 +0000

Last week I noticed something annoying: the engagement on my Week 1 and Week 2 posts dropped, even though the content was objectively good. So I asked Perplexity when developers actually read dev.to and the answer was basically: please stop posting on Saturdays. No one is there.

From there, Wykra updates move to Monday morning. Let's see if the stats agree.

First, I Need Actual People

This week is about taking Wykra from we can find influencers to we can filter them and analyze them in depth. In the previous post I explored several ways of discovering influencers and for this week I want to combine a couple of those methods rather than rely on just one. The plan is to mix a targeted Google query through the Bright Data SERP dataset with a Perplexity prompt through OpenRouter (or Bright Data) and see whether using them together leads to a more consistent shortlist. Google will be my starting point, but I already noticed that the SERP dataset often responds with "error": "Recaptcha appears", "error_code": "blocked" which makes it clear that having more than one discovery path isn’t just a nice-to-have, it’s self-defense. Google AI Mode also didn’t behave much better: the crawler kept returning "error": "Crawler error: waiting for selector \"#aim-chrome-initial-inline-async-container\" failed: timeout 30000ms exceeded", "error_code": "wait_element_timeout".

I spent a while thinking about who I should search for as an example this week and since I’m currently deep in a sourdough phase, it felt natural to look for people who actually bake sourdough themselves. I wanted actual home bakers, people posting their starter progress, fermentation attempts and sometimes failed loaves.. New York seemed like the perfect testing ground, so that became the theme for this round of discovery. The Google query I used:

site:instagram.com ("sourdough" OR "sourdough bread" OR "starter") ("NYC" OR "New York" OR "Brooklyn" OR "Manhattan" OR "Queens" OR "Bronx") ("bio" OR "profile" OR "baker") -restaurant -shop -bakery -menu -delivery

I also set "language": "en", "country": "PT", "start_page": 1, "end_page": 2 to limit final results but Google still returned a huge JSON. So I only took the first ten Instagram links it surfaced:

https://www.instagram.com/reel/DO6K4Pwjf4H/ Sourdough starter success video — making sourdough bread from scratch.
https://www.instagram.com/reel/DRHqgN6Daec/ Day-12 sourdough starter update; NYC baker documenting the feeding process.
https://www.instagram.com/reel/DRAmjx3kbR-/ Starting a new sourdough chapter in NYC — another early-stage starter reel.
https://www.instagram.com/reel/DQKmxHyCY55/ Day-9 sourdough starter update; fermentation, early growth and “Novi” progress.
https://www.instagram.com/emscakesntreats/reel/DRNDnk0jbrx/ Growing a sourdough starter and feeding “Doby” on day 14; home baker content.
https://www.instagram.com/bigdoughenergy/ Profile of an NYC home baker and bread artist sharing sourdough loaves and recipes.
https://www.instagram.com/reel/DQVAa6pjQ64/ Another sourdough “Novi” update — starter progress over days.
https://www.instagram.com/reel/DQXfFbYCcka/ Day-14 sourdough starter update; patience and fermentation notes.
https://www.instagram.com/reel/DQuKbYuE0YY/ New York–style multigrain sourdough bagel being boiled then baked.
https://www.instagram.com/p/DQzLEQQDhBL/ Olive sourdough inclusion loaf; standard sourdough-baking reel with a finished bread photo.

As you can see Google mostly returned individual posts and reels rather than profile pages. I think that’s normal for Instagram SERP results, since Google indexes post URLs much more consistently than profiles. I extracted the profile handles from those post URLs. Google’s results shift every time, so whether you get anything useful is basically luck.

The result is fine, but definitely not great. Yes, there’s some actual baking in there, but the list is full of repeats (though I took only first 10 results), the same creator keeps resurfacing again and again. And this is still a relatively forgiving query, when I tried the same workflow for pizza bakers in Lisbon, Google basically returned nothing at all. Technically there was one result, but it turned out to be a pizza equipment shop, not a creator.

The Perplexity prompt follows the same idea:

This is what I got:

I ran the code several times and the model returned a different set of accounts each time, so there’s no stable or repeatable result here either, but at least it consistently returns profile URLs right away, which already puts it ahead of Google.

Then I decided to try a different approach, the one that occurred to me earlier but I only now got around to testing. I started by identifying hashtags and only then moved on to the posts.

The first call returned a set of NYC-specific sourdough hashtags: #nycsourdough, #sourdoughnyc, #artisanbreadnyc, #nycbakers, #brooklynsourdough, #manhattansourdough, #nycbread, #sourdoughcommunitynyc, #breadstagramnyc, #sourdoughnewyork.

Then I passed these into the second prompt, keeping strict rules: only real Instagram profiles, no brands, no bakeries, no invented handles.

The final list I got was:

Only one profile, brooklynsourdough , overlapped with the previous list, which shows that this method surfaces a completely different slice of the NYC sourdough community rather than reinforcing earlier results.

Although in this case I’m searching in a huge city with a very broad range - not restricting creator size, niche depth or even which part of New York they’re in. The experience was again very different when I tried the same workflow for pizza bakers in Lisbon. Google returned exactly one result that was even remotely relevant (and that turned out to be a pizza equipment store), while Perplexity, across three runs, confidently produced several profiles that simply do not exist. I tightened the system prompt to explicitly forbid inventing handles, however occasional hallucinations still sneak. Honestly, Instagram is not an easy platform to automate against and both methods struggle in places you wouldn’t expect.

If you want to try the same searches yourself, here’s the Jupyter notebook I used - you can open it and play with the prompts: https://github.com/wykra-io/wykra-api-python/blob/main/research/search.ipynb

Looking Inside the Profiles

After the discovery step I had around twenty Instagram handles, but I still did not know who was actually relevant. Some looked like real NYC sourdough people, some were just general baking and some might be not relevant at all. Before going deeper I wanted a quick sanity check that an LLM could at least separate “probably relevant” from “why is this here”.

I pulled the full profile JSONs from Bright Data’s Instagram dataset. Each snapshot included account-level metadata plus a slice of recent posts, which is great for analysis and terrible if you try to send it to a model as-is. Anyway I wrote a small minimizer in Python. It flattens the raw profiles, skips private accounts, filters out profiles with fewer than 1000 followers and also removes any profiles that haven’t posted in the last six months, then keeps only a short summary:

– basic profile info such as handle, profile name, followers, posts count, bio, category

– a few engagement and account type signals (business, professional, average engagement)

– a sample of recent posts, sorted by datetime, with caption, datetime, likes and comments

If you want to see the actual data rather than the description:

The full profiles JSON is here: https://github.com/wykra-io/wykra-api-python/blob/main/research/profiles.json

The full notebook with the data-collection code is here: https://github.com/wykra-io/wykra-api-python/blob/main/research/analysis.ipynb

The reduced version of the profiles is here: https://github.com/wykra-io/wykra-api-python/blob/main/research/short_profiles.json

For the sanity check I used Claude 3.5 Sonnet through pydantic-ai and OpenRouter. The system prompt tells the model what it is looking at and what to do with it, the user prompt is just the minimized profiles plus that query.

After the profiles are reduced to the fields that actually matter, the model has no trouble ranking them. It reads the bios, looks at the recent posts and places the bakers in a reasonable order, finally something in this pipeline that didn’t fight back:

An interesting detail: when I compared Claude’s ranking with what Google SERP and Perplexity returned, the final shortlist contained accounts surfaced by all three methods.

Second Layer Discovery: Exploring Related Accounts

Next I noticed that each profile snapshot comes with a related_accounts list – basically Instagram’s suggestion graph around that creator. So I took the profiles that Claude ranked the highest in the first pass, grabbed all their non-private related accounts, turned them into profile URLs and ran the same pipeline again: fetch snapshots with Bright Data, minimize them and send the compact JSON into Claude with the same ranking prompt.

On this second hop the model mostly surfaced established NYC bakeries and cafés rather than home bakers. The top result was lanicosia_bakery (a 100-year-old Bronx bakery) with a relevance score of 4, then zeppieribakery and a couple of NYC-based dessert accounts like atoricafe and bitesbybianca with low scores. Most of the remaining related accounts either weren’t in NYC, weren’t about baking at all or had nothing to do with bread or sourdough, so they didn’t make it into the ranked list.

Even though the graph hop felt “smart” on paper, “follow who the good bakers are connected to”, in practice it quickly drifted from “NYC home sourdough bakers” to “general NYC food and bakery accounts” with only a few partially relevant hits.

Two Speeds, Two Jobs - Fast Discovery vs Deep Analysis

Before this point everything I’ve built assumes a pretty simple goal:

"Give me a shortlist of creators who match my prompt."

For that the flow is fast and relatively efficient: Google + Perplexity → profile snapshots → lightweight relevance scoring → done. It’s the right tool when a user needs quick inspiration, a direction to explore or a starting point for outreach. But that flow collapses the moment the question changes from "Who should I look at?" to "Is this creator actually good?".

A real evaluation, pulling all posts, reels, captions, timestamps and comments for the past 3–6 months, checking formats, identifying sponsored content, measuring post-level engagement, analyzing content topics, is a completely different workload. Running this for ten creators at once would be both slow and unnecessarily expensive. And honestly most users don’t need that for a discovery task.

Which is why Wykra needs two separate modes:

1. Fast Discovery (the default)

You get a shortlist of accounts ranked by relevance. Enough to browse, compare and filter.

2. Deep Dive on Demand

When a user says: “This creator looks promising, analyze them properly.”

That’s when we pull the full dataset. It’s slower, more resource-heavy and it should be opt-in. But it gives an actual, trustworthy picture of a single influencer.

Most importantly, this matches real workflows: sometimes you want a list; sometimes you want the truth.

I took one of the creators Claude ranked highly aya_eats_ (11218 followers, avg engagment 0.7181) and pulled they recent posts and reels for the past six months. Instagram essentially has three content types: posts, reels and stories. Reels dominate attention these days, posts still matter for evergreen content and stories would be valuable for analysis except they can’t be scraped, which is unfortunate because that’s the only thing I personally ever watch. So I just threw all posts and reels into one DataFrame and sent the JSON to Claude to see what kind of basic analysis it would come up with.

To see what this looks like in practice, here’s the raw output it produced:

I checked the same data with a few simple Pandas summaries and the results were almost identical. Reels absolutely dominate: around 3,300 likes and 36K views on average, compared to posts that barely hit 28 likes. The posting rhythm is steady: roughly 1.5–2 posts per week, with activity increasing from September to November and everything lands in the evening hours (20:00–23:00). The hashtag usage matches the themes Claude picked up: baking, Asian recipes and seasonal content. Engagement by theme also tells the same story: Asian-food reels perform an order of magnitude better than anything else. Brand presence shows up through light, organic mentions (@bobsredmill, @vitalfarms, @staub_usa) and there are zero paid partnerships, which supports the “authentic home cooking” impression.

Testing with a small creator is cute, but the patterns barely move. For something closer to reality, spikes, trends, actual performance curves, I ran the same workflow on a bigger creator, my favorite fashion blogger _liullland (~187k). I pulled 186 posts and reels over the last six months, limited it at 100 items so Claude wouldn’t choke and asked it to summarise what’s going on. The full JSON + analysis is in this gist:

Then I dropped the same data into pandas and ran a few simple charts: likes over time for posts vs reels, top hashtags and a scatter of views vs likes for reels.

Reels dominate the account - every major spike comes from video, while static posts stay almost flat. There’s also a noticeable spike in the last month, engagement jumps sharply and I have no idea what caused it.

I’m not a fashion blogger, but even I can see the hashtags repeat a lot, mostly fashion/GRWM variations.

And the views-vs-likes scatterplot has strong correlation, no weird dead-view content, plus the occasional viral reel that pushes the whole account upward.

By this point I’m pretty sure nobody is still reading, so it’s a perfect moment to stop here and continue the deeper analysis next week.

All this scraping, ranking, filtering and re-checking also made something obvious: we shouldn’t throw away the results we already spent time and money collecting. If Wykra finds a solid creator, that data should be stored and reused instead of fetched again from scratch. And we should explicitly ask the user whether a suggestion was useful or not - that feedback needs to be saved too.

The data will still have to be refreshed periodically (otherwise we’d just turn into an outdated Instagram directory), but at least future lookups won’t require rebuilding everything from zero.

Next week I’ll continue the analysis and dig a bit deeper into the data we can reliably scrape and interpret.

If you want to support the project, ⭐️ the repo and follow me on Twitter/X - it really helps.

Repo: https://github.com/wykra-io/wykra-api

Subscribe on datobra.com to not miss new posts. Updates: @ohthatdatagirl.