Forem: Ivan Zakutnii

🎰 Stop Gambling with Vibe Coding: Meet Quint

Ivan Zakutnii — Wed, 17 Dec 2025 21:20:49 +0000

Let's be real for a second. Prompting Claude, Cursor, or ChatGPT feels amazing... until it doesn't.

You know the cycle:

You type a vague prompt like "Make me a auth system."
The AI spits out 200 lines of beautiful-looking TypeScript.
You get that dopamine hit. "I am a 10x engineer!"
You run it.
ERROR: undefined is not a function.
You spend the next 4 hours debugging code you didn't write and barely understand.

That’s not engineering. That’s a slot machine with syntax highlighting.

The Problem: AI (and most of us frequently) Has No "Chill"

Vibe coding is chaotic because LLMs are people-pleasers. They want to give you code now, regardless of whether it actually makes sense architecturally. They lack a Thinking Framework. They don't check invariants.

They just... vibe.

What if you could force the AI to actually think before it types?

Enter Quint 🛠️

Quint is a tiny, no-nonsense toolkit for AI-assisted engineering.

It’s not a new IDE. It’s not a bloated SaaS wrapper. It’s a set of CLI commands (currently) that act as a "Thinking OS" for your AI tool collaboration – making both of you more rigorous.

Whether you use Claude Code, Cursor, Gemini CLI, or Codex, Quint sits in the middle and says: "Hey AI, before you write that function, prove to me it won't break the build."

Why You Should Care (Right Now)

The current version is commands set only. No UI. No dependencies to break anything. Tiny, almost zero overhead in the context for your Claude Code or Cursor.

But We’ve been using it in real scenarious of tackling highly complex engineering and even marketing tasks, and you know what... the difference in the results quality is wild.

Instead of getting "plausible spaghetti," we get decision records that actually respect the goals and are linked with evidence.

It just make sense.

What's Under The Hood? 🧠

The current latest version of Quint Code implements about 10% of the First Principles Framework (FPF) – an original, brilliant but formal, and very complex specification of the "thinking OS" created by Anatoly Levenchuk.

Now, I know what you're thinking: "Only 10%? Why release it?"

Because that 10% is the Pareto Principle in action. It turns out you don't need a PhD in Formal Logic to improve AI and AI + Human collaborative reasoning.

You just need to force the AI to acknowledge few Invariants and Reasoning Chains. And then act as External Transformer, like an Oracle or... Overseer.

Even this minimal implementation forces AI agents to plan the decisions and thus the follow-up work much better than any heuristic planners and to-dos.

🔮 The Near Future: v4.0.0 & The MCP Hype Train

We are close to dropping v4.0.0, and it’s going to be a banger.

We're introducing a tiny MCP (Model Context Protocol) Server which will handle FPF kernel and Invariants better in local sqlite database + same MD files.

It will allow Quint to feed the AI persistent context about your project's "Laws of Physics," as well as your rules and past decisions mostly automatically.

We're aiming for ~75% FPF Invariants support with this MCP. Still small. Still focused. But dramatically smarter.

🧪 Try It. Break It. Roast It.

I don't want "polite" feedback. I want feedback from devs who are actually in the trenches using Cursor/Claude Code daily. Who have zero ideas about this scientific stuff and all the systems engineering formalities and methods that I do love, but the devs, as I said - in the trenches. And I need the feedback from the trenches.

Does it fit your flow?
Is the README too confusing?
Did it save you from a hallucination?
Did it help you to plan complex tasks better?

🔗 Link to Quint Code Repo: https://github.com/m0n0x41d/quint-code

Go ahead. Try it out. If it sucks, tell me why. If it fixes your vibe coding hangover, tell your friends.

Quint is a small tool, but it has a damn big brain energy.

Thanks for reading,
ivan zakutni

Exception Generation is Pure Evil

Ivan Zakutnii — Tue, 05 Aug 2025 18:47:02 +0000

Exception generation. You certainly know what this is and have definitely used it, if you've programmed even a little.

Here's news for you: exception generation is very often bad.

In any situation where a try/except (or try/catch) block is used not for "wrapping" side effects like reading files, going to the network, or in an isolated place around a function (for targeted debugging purposes or simply because "there's just no other way") — it's better to get rid of it.

But why? Because such try/except blocks seriously confuse two "processes":

Program execution flow
Developer's thought flow who will read this code

Think about it yourself — any instruction in a try/except block can throw an exception.

Does it seem like each such block increases the complexity of our program's execution flow graph exponentially? It doesn't just seem so.

Modern static analyzers experience... some difficulties with "invisible execution paths".

Don't miss from me: https://linktr.ee/m0n0x41d

Brief historical background.

People started talking about exceptions a very long time ago — in the 40s-50s. The first software implementation appeared in LISP, where there were two functions — ERROR, which was called in case of a program failure, and ERRORSET, which returned, attention — either an error code or NIL.

In other words, this story in LISP is not about the exceptions we have now at all... The concept and implementation evolved, evolved, and developed to the point where instead of error codes we started generating exceptions, namely — creating exception conditions and handling them separately.

The period of mass distribution of exception generation fell on the 80s-90s.

And here's the result. From a study by Carnegie Mellon from 1999:

Exception failures are estimated to cause two thirds of system crashes and fifty percent of computer system security vulnerabilities. Exception handling is especially important in embedded and real-time computer systems because software in these systems cannot easily be fixed or replaced, and they must deal with the unpredictability of the real world. Robust exception handling in software can improve software fault tolerance and fault avoidance, but no structured techniques exist for implementing dependable exception handling.

I'm writing this article in 2025, and there's still no structured, reliable, implemented method for exception handling.

Mainstream OOP languages provide the same old try/catch, and frameworks spread several layers of abstractions on top in an even layer, further blurring error boundaries, often providing developers with no ways to handle errors except... wrapping it all in yet another try/catch.

On top of that, exception generation can seriously slow down performance. Here's fascinating reading from the Java world. Though it's not limited to just Java — here's an article about experiments in C++. In Python, especially with 3.11, the situation seems not so dramatic, occurring exceptions slow down code by just 3-5 times.

I'll just leave this here:

A try/except block is extremely efficient if no exceptions are raised.

They say that once upon a time, long, long ago, the grass was greener and programs were easier to verify. The times before the dominance of exception generation came were the times of Structured Programming — engineers wrote code using conditional operators, loops, and subroutines (functions with one entry and one exit).

Programs flowed, developed naturally "top-down". Thanks to such determinism, they were easier to read both for people and machines — the first static code analyzer lint was publicly released in 1979. Well, you already know what happened next.

What do we do with all this?

You can act like Google (especially if you write in C++):

We do not use C++ exceptions.

In any other popular object-oriented language you need to... stop using exceptions :)

Or at least don't bury them deep in code — let the try/catch block wrap a function right where it's called (outside or inside). Yes, this might add some amount of boilerplate code, but then when exceptions occur, the execution flow will return faster to the deterministically beaten "path".

Either way, we can find a way to program without using exceptions for flow control.

And you can also remember Golang. No matter how angular and dumbed down it might be, the biggest design decision aimed at strengthening Go code produced by developers is forcing them to handle every error explicitly. Yes, Go has defer/panic/recover, but you'll have to work hard to twist and mangle the execution flow severely.

And finally — functional programming. Result monad, for example. Curtain.
¯\_(ツ)_/¯

However, I think most developers will find it much easier to simply abandon leaking raise/throw — look at your try/except carefully — maybe you can use if here? Look at raise — how about return?

Why Augmented Intelligence after all?

Ivan Zakutnii — Wed, 28 May 2025 19:01:12 +0000

Why Augmented after all?

Why, when talking about AI, do I almost always primarily mean Augmented, not Artificial Intelligence?

What is Augmented Intelligence anyway?

In translation, this would sound like – Augmented, not Artificial intelligence.

It's quite difficult to establish the authors of the term, because the idea began developing long ago, long before the appearance of transformers, roughly around the same time when AI was first talked about as Artificial (late 60s, early 70s).

Augmented Intelligence is talked about in so many ways – both as a system design pattern, and as an approach to "human enhancement," productivity improvement. The word Augmented appears in the context of theoretical LLM training methodologies and much more. Someone, for example, would rather remember AR, and someone might even think about Musk's neuralink.

Therefore, in this post I want to talk about the interpretation from my point of view, why I use this term, and how seriously I take it.

My interpretation lies somewhere based on my experience, and still somewhat gravitates toward a "design pattern," we just need to figure out the design and pattern of what.

TLDR;

Artificial Intelligence – is about autonomous, independent agents that work completely without humans 🤘🏻
Augmented Intelligence – is about a something like symbiosis of human and AI, where each strengthens the other 👤🤝🏻🤖

In reality, the second (augmented) already works much more often and better than the first, is already much more present here than completely independent artificial. And as technology develops, the symbiosis mindset still looks more appealing than the "exploitation" mindset, even, not if – but when, artificial neural networks become even more cognitively powerful!

Why exploitation? Because Artificial from the perspective of human "collective unconscious" is already fully forming as our species' relationship to AI as proprietary. Have you seen all those memes about bullying and threats against poor ChatGPT?

Being the boss – might not be bad, but cultivating a boss mindset, in my opinion – is counterproductive.

Let's return to my experience and move away from science fiction and anthropological speculation.

Mindset matters

I've already started talking about mindset, and this is very important. Unfortunately or fortunately, AI is quite limited, especially as soon as we try to apply it in a narrow, vertical niche.

When communicating with conditional ChatGPT or Claude, many people get the impression that "it's very smart," "it can do almost anything." This impression forms not only among individuals, but also among companies, entrepreneurs who attempt to build new business with LLM or start a new one altogether.

Even quite experienced engineers and managers can find themselves believing that AI systems are something completely magical, quite simple in terms of implementation – "well, it's already smart, right? It'll figure out our mess somehow, and we'll live happily ever after!"

As a result, another misconception follows – AI-related projects don't need to be thought through or modeled at all.

In reality, this turns out to be a path to nowhere. Teams lightning-fast encounter difficulties and spend a long time learning the hard way, feeling out what and how LLM can do, and what it can't, ultimately coming to the fact that people have one language in their heads, the model has a completely different language, we'll still have to engage in ontological modeling of both the system itself and the language in which we communicate with each other and with the AI agent.

It turns out that humans are needed in AI systems much more than expected... and here we smoothly approach Augmented.

Human as part of the system

As long as AI hasn't become a recognized or even a form of life, we continue to collectively exist in human society. Knowing or at least suspecting AI's limitations, and especially knowing about the exceptional potential of this technology, I believe it's correct to talk specifically about Augmented Intelligence as a way to effectively change our reality and ourselves for the better.

Already now, more and more quality AI systems are appearing, but all these surviving systems are either extremely specific and aimed at performing concrete tasks, or complex systems, workflow pipelines, which necessarily have human(s) who at minimum service the system, but often also perform the role of observer and quality controller.

Moreover, these systems themselves are obviously aimed at being used by end users – people, outside the organization as clients, or inside, if we're talking about AI Platform.

I'm getting at the fact that even if we consider a scenario of a "very vertical" agent/AI system – it was still modeled, created by people, and serves people. Whatever AI is, however we consider it – it's already closely linked with our large and small systems.

What about systems that still manage to be "without LLM," and which might even manage to remain so for a long time, due to business specifics – here it still makes sense to talk about Augmented, because the efficiency of people working in such organizations can (and often already does) accelerate thanks to rational use of neural networks in their work.

Two scenarios of AI usage

If we consider from the end user's perspective, AI has now become that very magic box, with which the user can start becoming very stupid (Artificial Intelligence mindset – the user almost completely delegates slow thinking to LLMs, and their cognitive abilities degrade), or develop, learn, get smarter, accelerate their work on routine tasks with the help of LLMs, quickly get feedback, strictly evaluate it – delegate a lot of fast thinking to LLM so it happens even faster, some selected part of slow thinking, but continue to make decisions (intermediate and final, as well as apply mental effort to the task independently (Augmented Intelligence mindset).

What's next?

I think there's nothing wrong with creating more and more AI products, agents to help human agents or other AI agents.

Wherever the evolution of artificial neural networks goes, I'm completely uninterested in speculating about doomsday and other "analytical" nonsense – obviously, we won't stop anymore, and benefits are still manifesting more than harm.

Yes, there seems to be a threat of depriving many people of physical labor, well, maybe it's time to work more with brains and create? Or maybe on the contrary, more jobs will appear, of completely new quality, but which will still have to be adapted to in an Augmented manner?
I don't know, in this blog I'm not going to waste my and your attention on these reflections, there's too much such material, and if you're interested – you'll easily find it.

As I already said – Augmented Intelligence is about cooperation of different agents, strengthening each other. I want to believe that in this "design pattern" we'll start building systems in which we not only better cooperate with LLM, but also as human with human.

Therefore, in my blog I write specifically about designing such "symbiotic" systems - where AI strengthens humans, at least even with the result of work on assigned tasks.

I believe that the conversation about Augmented – is grounding in reality, in understanding systemicity and interconnections. This is a path of thinking toward the conversation itself and efficiency overall. The conversation about Artificial – is too ephemeral, often speculative and entertaining, simply because it happened that way, focusing on artificiality looks like directing attention only toward LLM, whereas in the conversation about Augmented we don't forget about ourselves, and about other agents that will work in our system.

If you're building AI systems and want them to actually work in reality and in production - welcome.

Productivity through primary metrics: how to finally understand where your time goes

Ivan Zakutnii — Tue, 15 Apr 2025 17:10:35 +0000

Introduction

In a world where we're constantly offered new task management apps, time management techniques, and approaches to increase productivity, many of us find that despite all these tools, our real effectiveness barely changes.

We implement GTD, practice Pomodoro, use Kanban boards, read about Atomic Habits and Japanese Kaizen — but somehow we continue to feel overwhelmed and not productive enough, and often abandon these ideas altogether.

The main problem is that most productivity methods operate in a vacuum — they offer systems for organizing tasks but don't address the fundamental question: what exactly are we spending our time on in reality. Without this basic understanding, even the most advanced methods remain just theoretical constructs that can't significantly change our lives.

In this article, I propose a radically different approach to productivity, based on collecting and analyzing primary metrics — objective data about how your time is actually distributed.

We'll look at the two most important resources – time and attention.

By honestly applying the practice I'm suggesting, you'll be able to:

See the real picture of your time expenditure without illusions or self-deception
Identify hidden sources of time loss that you might not be aware of
Create a personalized, convenient productivity system based on your preferences

Also from this article, you'll learn:

Why without primary metrics, any productivity system is almost certainly doomed to fail
How I came to realize the importance of measuring time through personal experience
A simple and flexible system for categorizing time that you can start using
How to implement time tracking without turning it into an additional source of stress
How to analyze the collected data and use it for real productivity improvements

Don't worry — I'm not suggesting you become obsessed with every minute of your life. Instead, I'll share a balanced and fundamental approach.

But all this will only work if you're truly ready to finally see where your precious time is actually leaking away.

Why we need primary metrics

Primary metrics are extremely important whenever we truly strive to change any process for the better. Today we're looking at our own lives and work on any projects as the process.

But first, I'll tell you my path to the proposed system.

For quite a long time, I used todo lists and something resembling GTD. The everyday life of a family with a small child in rented housing in a foreign country can be quite unpredictable, but todo lists have increased my productivity over the past year and a half.

The more tasks I delegated to my todo list, the more computing power remained in my brain to perform work tasks more focused.

And I started to perform much more of them, so much so that I began to overwork in unhealthy proportions.

When I realized that overworking was a big problem, I started using time tracking to account for the time I spend on work. This was a purely intuitive decision; I was tired of overworking.

Although such tracking didn't become a habit right away, over the course of several months I started working a reasonable amount of time.

But I didn't feel like I had more resources for my own projects. I was still responding to any new ideas like this - "when would I ever have time for that?"

I spent most of my working time in focus periods, measuring them using the pomodoro method.

I was doing a lot, but I wasn't doing it productively. Although all work tasks were moving forward, almost all of my projects looked as if they were standing still. I didn't have time to regularly write a blog, think through and develop my side projects. I only had time to work, create new tasks for myself, study software engineering, and do something else in bits and pieces.

The problem turned out to be that I wasn't trying to specifically ask myself the question - "where am I spending my time" and honestly find an answer to it.

The answer came through primary metrics. How do you collect these metrics?

Track time.

Wait. But wasn't I measuring the time I spent on work? Yes, I was, but the goal wasn't to collect metrics and understand where, what I'm actually spending time on, but simply to stop overworking.

And I achieved this goal, but didn't achieve an overall improvement in productivity and quality of work across all my areas.

I filled all the other time (outside of work) with tasks that I managed to grab from the first or second priorities of my todo list.

Everything changed when I became familiar with the idea about the importance of primary metrics, the importance of grounding any ideas and theories in reality.

In fact, the applied methodology of productivity and work organization is not so important. It's not important in the sense that there are many such methodologies, and different methodologies are well suited to different people.

But no methodology will work properly if you don't realize what you're actually working with, what's really happening? How much work and resources are being spent? How productive is it in terms of progress toward the result?

Our most valuable resource is time, and in order to take a step into a more productive future, you need to understand where you're spending this resource. Then you can choose whatever methods you like (or even develop your own) and use them effectively.

Perhaps you're already scared now, and you have thoughts like this - "it's absolute madness to track every activity by the hour! I'll go crazy, I'll just waste more time constantly recording these time intervals!"

You're absolutely right – just trying to track where you spend your time without a system is difficult, will cause a lot of stress, and might be the opposite of the idea of efficiency.

That's why I'm writing this guide article for you.

Time categorization system

Today I'm offering you a categorization system that will help you collect metrics without much effort.

Good news:

It's not difficult; you already have everything you need to get started
You don't need to mark the time minute by minute, and you probably won't be able to. A drift of 15-30 minutes is more than acceptable to begin with
I'm not suggesting you become a productivity maniac and track time for the rest of your life (this is an optional habit, and you're free to decide in the future whether you need it or not)

Bad news:

You'll have to work independently with what you discover. I'm not giving you a fish, but a fishing rod.

As you can see - there seems to be more good news, right? :)

First-class abstraction

A good abstraction in software engineering is one that generalizes rather than specifies.

So, to get closer to a breakthrough in productivity, I suggest you try tracking how much time in your life you spend, divided into the following categories:

Productive time
Investment time
Body maintenance
Everyday life and rest
Time wasted

Let's understand what these categories are.

Productive time

Productive time is time you spent on focused work on specific tasks (like work tickets from a Kanban board). This is time spent on tasks that bring you closer to achieving some already outlined, planned, and specific goal. For example, strategic planning can also be included here - because it's aimed at moving somewhere, it's clear which direction to "dig," we actually have a task like "Research X to understand how to do Z for project Y."

Investment time

This is time you spend on self-education, general research work (searching for new ideas and opportunities), or on your side projects.

Investment time is called that - not only because learning and research are truly investments in your future but also because in case of urgent need, this time can be "sacrificed" in favor of productive time to complete a project or deal with some work emergencies.

Body maintenance

Sleep, food (quick preparation, but not fast food). Normal food that can be prepared and consumed in a reasonable time, not 2-3 hours. If you decided to prepare a lavish dinner for a celebration in the middle of the week - such time is more related to "Everyday life and rest."

In general, the body maintenance category can be omitted from time tracking altogether, especially at the beginning, or if you have other monitoring sources (smartwatch/bracelet).

But we'll come back to this.

Everyday life and rest

Everyday life and rest are all activities related to maintaining the home, paperwork, taking/picking up a child from kindergarten, and other routines.

Perhaps this category will be the first candidate for finding "time wasted." Well, or at least you'll have something to grab onto and start asking yourself questions like "what was I doing here for SIX hours? What kind of everyday life is that?"

We should strive for a system where the everyday life and rest category is not used to plug holes in your schedule, except for emergencies like hospital visits, or household chores that cannot be rescheduled from the middle of the work schedule to another time.

It's better not to mix productive and investment time with other activities because, according to Kahneman, the brain takes quite a long time to get into slow thinking - 20 minutes! This means if you work on a difficult intellectual task using the pomodoro method, you'll just be warming up during the first pomodoro! It's not very productive to jump up and go on errands after that.

You can start by tracking in this category all the time that doesn't fall under all other categories.

Be vigilant, "all other" means including the one below!

Time wasted

This category is for time that you "invested" in nonsense that brings you nothing (or almost nothing) useful in the long run.

By default, it's not assumed to deliberately spend time (for hours) on nonsense. This category is rather designed for rewriting blocks of time (in whole or in part) from productive or investment time, during which we worked inefficiently or were distracted.

Of course, it's also worth considering a scenario where you're in a position of personal growth, where you're already aware that you're wasting time on nonsense, but this nonsense is so pleasant to you and allows you to seemingly rest well that you can't get rid of it yet. This is normal! No judgment!

In this case, you can purposefully spend time on nonsense, so that later in the context of the week you can see how much goes there. Perhaps such a metric will help you stop doing this nonsense.

In general, it's entirely your subjective matter to determine what is nonsense for you and what is everyday life/routine/rest. The topic of quality rest is a candidate for a separate article, subscribe!

Practical implementation of the system

How to use this and what to do with it?

The categories presented above are a starting point, a framework on which I suggest you build your own approach to measuring your real time expenditures.

Here is a key point about composure and mindfulness.

Measuring metrics by itself gives almost nothing; you need to have a goal, and after collection, analyze these metrics.

The goal can be formulated very simply, for example - "I want to manage to do more!", or more specifically "I want to stop spending so much time on nonsense", or even "I want to find time to start doing X", where X can be anything - learning a new profession, or pursuing a hobby.

Tools for collecting time metrics

Here, for example, is what a report looks like in my chosen time tracker for the week:

32 hours on Saturday is certainly a lot!

I definitely remind you again - you don't have to be a time tracking maniac if you don't like it / don't need it / find it difficult. Look, I even have some errors here, that's normal!

Maniacal tracking by too detailed categories exhausted me literally in a couple of days; eventually, I started using general categories, even for rest.

It's important to say again about composure, about mindfulness - to the extent of understanding, be honest with yourself when tracking Rest / Everyday life. You must ruthlessly catch yourself trying to infiltrate obvious nonsense into any category other than the "nonsense category"!

If it's difficult for you here too, resort to studying metrics, for example, screen time, which, as far as I know, all modern smartphones collect in a convenient format.

If you see in the metrics that you spend a lot of time on nonsense - this is very important, that's exactly what we're looking for here! We're interested in how things are in reality, remember? So record everything honestly!

On my graph, it's clear that Investment Time is more than Productive. Does this mean that I worked on tasks for 7 hours a week? No, investment time consists almost entirely of research tasks.

And of course, I couldn't have tracked all this if the process wasn't convenient.

Not an advertisement at all - I use free toggl. There are web, desktop, and mobile clients that synchronize normally, and as you can see in the screen above - you can build a pretty convenient categorization by "projects".

But why so little Wasted Time? Why do I need to track time if I'm already a productivity cyborg? No! Not a cyborg!

As I've already said - before I started collecting these metrics, I had no time for any projects... Time tracking didn't help me spend less time on nonsense; time tracking helped me understand that I have problems with focus! And with focus not in terms of distraction to social networks, but with overloading myself with too many parallel tasks! We'll talk about this further.

But for now, let's get back to tracking.

The good news is that you already have everything to start tracking time. You definitely have a clock somewhere, even if you have a prehistoric button phone in your pocket.

If you're an active gadget user, even better! You can choose from dozens, if not hundreds (if not thousands!) of applications for time tracking, with pomodoro, todo lists, and the like.

But don't rush! Maybe you already have too many such applications; more on this later.

Just choose and start! Follow only the basic provisions of the framework and adapt it to yourself.

Let's outline an action plan:

Set a goal for yourself, give yourself a clear "direction" - why are you doing this at all.
Figure out convenient category names for you, as long as they follow the general idea of division described above.
Choose a way to start collecting metrics - an application on your phone, a notebook for handwriting, anything!
Collect metrics for at least 7 days in a row
Analyze the result
Find oddities, inconveniences, improve categories if necessary, and repeat the collection of metrics.

Instead of one week, you can collect metrics for two, or a whole month. If you suddenly find it simple and natural to start tracking time constantly in one format or another - don't be scared, you're not a maniac! Go ahead!

In any case, once you find a system that's comfortable for you, it will be very useful to conduct such a weekly audit of your productivity at least once a quarter, or once every six months.

It's also possible that after two such measurements, you'll feel that you never need them again, you've understood everything, and you've "ramped up" into effective and mindful work throughout life. Great, if you don't want to measure anymore - don't measure!

Just don't forget about such a wonderful mechanism; perhaps it will come in handy when you start working on a new project, or get a new job, and suddenly find yourself overly tired and buried under a mountain of endless tasks, which are all urgent and needed to be done "yesterday."

The temptation to fall into specifications

At some point, an excellent thought may visit you - why don't I start monitoring my time in more detail?

For example - here I seem to be staying in the recommended categories, but I'll record how much time I spent working on project A, how much on calls for project B, how much time I ran on the treadmill, how much I worked with weight, how much time I spent preparing a salad, and how much I cooked borsch.

Although sometimes such detail can help - for example, I found that sometimes I got stuck in the shower for 20 minutes - there's still a strong risk of getting confused, falling into excessive details, overcomplicating the process.

I suggest sticking to the most generalized categories.

Understand, you don't need to know how much total time per week you spend preparing salads. Your main task is to find out as accurately as possible how much time you actually work and how much time you actually spend on nonsense.

From these metrics, you can delve into questions like "why so many tasks? Are they all really urgent and relevant?", or "why so many calls? Is it true that they're all productive...?", and so on.

You can always go into any detail, as long as the metrics are collected at all. And it's by no means certain (rather the opposite) that it will actually benefit you to track time separately for each work task. That's madness.

Depending on your main activity and lifestyle, the proposed categories may seem too general or not quite correct.

It's perfectly normal to slightly specify or rename them. For example, your categories might look like this:

Writing blog articles (investment time)
Writing code / solving work routine (productive time)
Learning (investment time)
Exercising (everyday life and rest)
Reading books / articles / blogs (everyday life and rest)
Doing nonsense (Nonsense!)

Or like this:

Working on client projects (productive time)
Studying new trends and design tools (investment time)
Creating works for portfolio (investment time)
Household chores and personal life (everyday life and rest)
Aimless scrolling of social networks (time wasted)

Note that in both of these examples, I completely omitted tracking body maintenance, and that's perfectly normal.

If you have a fitness bracelet or some kind of smartwatch that tracks how much you sleep, then consider that you already have a metric about sleep.

Pay attention to the quality of your sleep and rest! If you don't sleep, then no tracking will save you!

Meals and their preparation may naturally be visible in the calendar or between records of time spent, so they can also be omitted (just make sure that lunch doesn't turn into scrolling news/TikTok for another 40 minutes on top.)

As you can see - there's plenty of space!

The main thing is that your categories follow the division into productive and investment time, healthy routine and everyday life, and finally a category for nonsense.

The Nonsense category is needed even if you're super successful and mindful and don't spend more than 30 minutes a week on nonsense. Just by stopping to even consider that such a category exists, you're already giving nonsense more chances to seep into your life.

From collecting metrics to action

Analysis of the data obtained

Analyzing the collected metrics is a slightly more complex process.

When you start measuring time, you'll have the opportunity to study the proportions of your time investments across days or, even more effectively, an entire week.

However, the numbers themselves won't give much — in parallel, it's necessary to develop the ability to work qualitatively and rationally directly at the moment of performing tasks.

This means increasing the level of composure and efficiency in each work session, which, combined with the analysis of time metrics, will give results.

This is exactly the reason why I included mentions of TODO lists and pomodoro practice in this article in the Primary Metrics section (and other places).

If you don't start developing your composure in the moment, you'll have nothing to record in wasted time.

I'd like to tell you that you can neglect composure, but unfortunately, that's not the case. You need some additional mechanism that will help you determine what time was wasted and what wasn't. And the main mechanism here is your responsible attentiveness to your own life activity.

Let's look at a couple of examples of how to determine time wasted in vain. It sounds simpler than it might seem.

For example, you have a scheduled zoom call with colleagues. If the call was productive, in the sense that 2/3 wasn't small talk, and after the end of the call, it became clear what to do next. This understanding is expressed purely in the subjective feeling of "we agreed about X, now I need to do A and B", against which there is simultaneously a rather bright and terribly vague feeling of "What were we talking about? Why did we gather? What should I do next, huh?"

If you catch yourself on option 2 - boldly record the time of such calls as wasted.

Right now, you probably don't have almost any mechanisms to influence the regularity and quality of such calls, especially if you're an individual contributor, not an operational manager. A large number of such "vague" calls is a serious indicator of problems with processes and probably overall productivity in the organization, so I can only suggest that you focus on areas where you can manage your time, and subscribe to me, as I will definitely continue to write about productivity and systems thinking.

Another example. Let's say you've decided that, like me, you're comfortable working on tasks using the pomodoro method to fight distractions - we turn on the timer and for 25 minutes, no TikTok!

If you really didn't get distracted - excellent! The pomodoro goes into productive or investment time depending on the work you were doing. Now you can space out for five minutes, or rest well :)

If you did get distracted, and significantly so, for example, you read emails for five minutes or more - this is definitely time wasted. I prefer to send all 25 minutes to the "trash," which is both cruel and simpler than cutting out 5-8 minutes from a pomodoro. Besides - then during analysis, these blocks will attract your attention and you'll be able to reflect better.

From metrics to action: task management

Now that we've figured out how to collect time metrics, the logical question is what to do with the freed-up time and how to organize your tasks.

After all, the ultimate goal of collecting metrics is not just to know where time goes, but to learn how to use it more effectively.

And here lurks a terrible devil.

Have you ever caught yourself having 500 interesting articles you want to read, and 48 projects you really want to start doing? While 10 of them are already in progress.

Please remember my story, which I wrote at the beginning of the article - in the previous year, I almost didn't waste time on nonsense, I worked a lot. But I didn't become more productive. Therefore, in addition to collecting metrics, you also need to learn focus, concentration on what's important, concrete matter.

There are also a bunch of methods for this, but I'll suggest you simply think and show rationalism. When you start collecting metrics and figure out where you're spending time, this will be especially useful if you suddenly find yourself in the same situation I was in - buried under a pile of tasks, multiple projects, and so on.

First - throw away all the tools you're already using, keep only one. Paper planner, todo application, obsidian with some plugin where there's a calendar and daily tasks - choose something that will be convenient for you or at least seem convenient (you can always change).

But don't go crazy; for tasks, you need one tool.
Or at worst - one for work, another for all personal projects for purely logical separation.

If you've been using multiple applications, then looking a little closer, it will become clear what you really used - in one of them there will be the most tasks, notes, visible activity. Great - that's it. It's already convenient for you.

Now clean your planner of garbage that steals your attention and time.

Perhaps, like me, you started at some point to use the todo list as a repository for notes, reading lists, and so on.

REMOVE EVERYTHING from your TODO list that doesn't relate to tasks you're currently working on - work and your projects, it doesn't matter. Leave only what you can't remove, something that's already somehow in progress, or is about to be taken into work and can't be canceled anymore.

There are two possible scenarios here:

You threw out all the garbage and everything became crystal clear; it turns out everything isn't so bad, and you only have 1-2 projects in progress, and you were only wasting time on nonsense
And more likely - Everything is terrible, you have 4-5 or more projects simultaneously, and a bunch of small tasks that logically don't relate to any projects but take away a lot of your resources.

So, for those who found themselves in the second scenario - consider each of your tasks, and ask yourself a specific question:

What is this task, to which project does it belong? Is it clear at all what it is and when it needs to be done? If not - then you forgot to throw away some reminder or note that has no place in a TODO list. Do it!
If it's still a clear task, then ask yourself what its essence is, what benefit and to whom will it bring - what will it affect in the real world? How well is it worked out? Does it really need to be dealt with right now?
Congratulations, you've either identified a valuable task, or found garbage again. There's still a chance that you're looking at something potentially useful - send it to the archive or some watch later on YouTube. The main thing is to clear up the planner.

Focus – that very composure

Okay, we've got a clearer understanding of where our time goes and how to organize tasks. But there's one more critically important element, a universal tool of productivity in any business - the ability to focus.

Without this skill, even the most accurate metrics and perfectly organized task lists won't bring the desired result.

I have good news for you! If you've read to this point without scrolling and in one sitting - you probably have good focus! The main thing is to clear out the garbage and collect metrics :)

Multitasking is a myth. It's not the path to success; it's the path to nowhere.

You need to learn to concentrate ON ONE thing, at every moment of time and on any scale.

Note that throughout this article, I've repeatedly mentioned the word composure. Your quality and composed attention is insanely important always, at any stage when we start to live purposefully, not just somehow.

Without composure, everything will always be "just somehow"

This may sound too stupid, but the longer you concentrate on one thing, the more results you'll achieve.

I'm not suggesting you necessarily concentrate on one thing for your entire life, but that's also an option. Imagine if you choose the meaning of your life - knitting scarves and hats. Every day you learn to knit, becoming better and better.

But that doesn't mean you'll only knit.

For example, at some point, you might want to knit for family and friends, and then it comes to mind to start earning from this - take it for sale.

And here you seem to have one goal and one business, but now "side quests" appear like:

Learn marketing
Develop a personal brand
Set up social media

And so on, but globally - the goal is one.

Many people for this reason jump from one thing to another, can't find themselves. Because they didn't set any of their undertakings as a goal, didn't plan to thoroughly try and understand whether they need it or not, but started doing something in an emotional impulse inspired by others' opinions.

The point is that concrete goal setting is also a skill from the category of composure.

And nevertheless, if you've chosen something more complex than knitting, for example, an engineering career - there can be many sub-goals that you can spend your whole life on.

That's why it's especially important to focus on one thing.

One brain definitely won't be enough for this, in the sense of the brain itself. That's why I'm telling this whole story about metrics - we need to constantly be reinforced by breadcrumbs from reality, so as not to forget what and why we're doing, where and why we're going, and who we are in general.

If you don't know at all what you need, regardless of how old you are - just start consciously choosing things and trying to do them, until you get bored, until you start understanding that you don't get satisfaction from what you're doing.

Just try, the main thing is to start doing at least something!

The secret is that by focusing on something for a week-two-month, you'll still learn something useful - it won't be wasted. At the very least, you can learn composure! :)

The main thing - stop wasting your life on complete nonsense and terrible dispersion!

Okay, it's time to finish this lyrical digression.

Why it's so hard for us to concentrate

The modern world creates unprecedented conditions in which it's increasingly difficult for our brain to maintain concentration.

First, we've evolved so that we can effectively process a limited amount of information. But modern technologies create a data flow that causes terrible cognitive overload, making it difficult to process and remember important information.

That's why your attention is truly invaluable resource, second in value only to time. And it should also be treated precisely as a resource, for a very simple reason - attention and time are very closely linked!

Second, multitasking is a disgusting myth. The brain cannot process multiple tasks simultaneously. Instead, it quickly switches between them, which requires significant energy expenditure and reduces work efficiency. With each such switch, we not only sag in productivity but also increase the probability of errors.

Third, the problem with dopamine. Notifications and social networks exploit the brain's dopamine reward system; this is also no secret. These things are literally created in such a way as to rob you - steal your attention!

Each new notification causes a small surge of dopamine, which creates an addiction to constant small "rewards" and disaccustoms us from the long concentration necessary for deep work. The endless feed is generally a complete mess about which I don't even want to talk.

The constant flow of information leads to brain burnout. Symptoms include:

Deficit of composure/attention - very difficult to concentrate on one task
Decision-making paralysis (even simple ones)
Dependence on technology to perform basic cognitive tasks - for example, a person doesn't even try to add 17+8 in their mind and immediately reaches for a phone calculator
Chronic stress and anxiety
Sleep disturbance - seems like you didn't drink coffee, and you're physically tired as a horse, but it's very hard to fall asleep

Recognizing all these signals and problems is the first step to restoring control over your attention (and life!)

The good news is that time tracking and developing awareness about where your concentration is really going can help with this trouble.

There are other useful practices that can help with attention training, but the article is already too long, so I'll tell about them another time.

And besides - I sincerely believe that grounding in metrics, which we're considering today, can give enough impetus for you to start solving this problem yourself (if it exists). Awareness, knowledge about the catastrophe won't let you close your eyes further.

I hope you haven't chickened out.

Conclusion:

We've examined three keys to real productivity:

Collecting primary time metrics through a simple categorization system;
Effectively organizing tasks in a single space;
Developing the ability to focus on one thing until it's completed.

Collect metrics! I believe you can do it!

This article turned out very large, and perhaps a bit chaotic in places, because the topic is simultaneously very important and very extensive. And too... personal! Because I experienced a qualitative leap in productivity when I discovered my specific problems - task overload.

Without time tracking, I wouldn't have seen what was already in the most visible place. So if I could force you to take away at least something from this article, at least one thought, I would force you to remember not about time categorization, but about grounding in reality.

Please, don't live exclusively in your head!

Subscribe to my social networks; I write about systems engineering, neural networks (human and non-living), software engineering: https://links.ivanzakutnii.com

Embedding Design Into Code

Ivan Zakutnii — Wed, 16 Oct 2024 07:45:41 +0000

Over-engineering can probably happen at almost any stage of design (or lack of design) and actual software development.

For example, when we’re implementing an entity and its methods, it’s far from guaranteed that all these methods will be small and elegant. And what’s more — they don’t always have to be.

Yet, when I see a function or method of 30-50 lines on my editor screen, I automatically ask myself: what can I factor out here?

Problems with Blind Faith in SRP

In a method of this size, there are always smaller logical pieces (SRP), and they are always easy to spot, especially if the code was just written and we’re still in the context.

But should we always start breaking the method right away? Will we always benefit from this?

Well, if you manage to do such factoring quickly and elegantly, then yes, you’re definitely drawing logical boundaries in your code, but you’re also adding a certain level of redirection.

Blindly following SRP everywhere can lead to a lot of methods.

Even if we try to arrange them in some “logical” and readable order, in the future, we (or other developers — pretty much the same thing) will likely have to scroll the code editor back and forth to figure out and remember what our methods do and when, following all the added redirection.

Of course, if we don’t do this factoring, we naturally end up in the opposite situation — keeping all our code in one method, we don’t get clearly defined logical boundaries but avoid an explosion of methods and nested calls.

To be more precise — we don’t get as clearly defined boundaries that SRP creates when logic is localized. This does not mean that we can’t express them.

Embedded Design

I once wrote that note called "Bullshit Of Self-Documenting Code".

Sure, when all our methods and arguments are properly named, and we even follow popular style guides, the code becomes much more readable and better than some randomly obfuscated code full of single-letter identifiers and one-liners.

And yet, by using comments or language syntax features, we can mark logical boundaries in a method without splitting it according to SRP.

It’s simple, and it’s pretty cool.
Especially in C-like languages with curly braces, where braces help group code within a function.

First, it’s regularly much clearer than any “self-documenting” tricks, and allow you to express more formality.
Second, almost all IDEs allow you to collapse such blocks of code marked by braces.

On top of that, these blocks can (and likely should) be marked with informative and expressive comments — awesome!

Using such an embedded design, we not only avoid blurring the logic across dozens of methods, but we also maintain a higher level of abstraction in our code, making it observable, as pieces of code alone do not reflect the design well.

We are literally embedding concepts from the “Programming in the Large” level into “Programming in the Small” world. These are cognitive shortcuts you’ll thank yourself for later.

Moreover, by doing this work in advance, we prepare everything for “physical” factoring. If we ever need it, doing so will be super easy because everything is already done, and we don’t need to think about how to do it!

Of course, embedded design shouldn’t be used as an excuse for boxing everything into one class and its methods. That’s criminal negligence, not embedded design.

Applicability in Non-C-like Languages

What should we do with Python? There are no curly braces here, and if you suggest to your colleagues that team should start writing in Bython, they might call 911.

At first glance It seems we don’t have many options for applying embedded design in Python projects. We can either write clear comments for code blocks or declare logically related blocks as separate functions within a method and call them in the desired order at the end of the method. Okay.

I’d like to say the first option is better, but the second isn’t bad either. It’s probably just a matter of style and personal preference.

Python is still a programming language, and even though it "lacks" curly braces, we’re still writing the program text in files, and in most cases, we’re writing OOP code. We can fully follow the idea of embedded design.

It can be applied not only within the scope of a specific method but also at the level of organizing functions in files or methods in a class.

The essence is the same — we divide functions or methods into logical clusters, and those clusters are separated by expressive but informative comments.

However, at this level, it seems that not all such logical subcategories can always adequately reflect designed business processes.

Nevertheless, we still get the same advantages upfront — our classes or files, following the idea of embedded design, are not only cognitively friendly for our future selves or new colleagues, but they are also ready for further factoring, and may be even re-factoring transformations.

Conclusion

Obviously, both approaches — SRP-based factoring and embedded design — have their pros and cons. But whats more important - these approaches are qualitatively different.

Of course, they’re not fully mutually exclusive, and choosing when to use each one depends entirely on the context (both - large and small) and your preferences.

Hopefully, you prefer cognitive efficiency over mental suffering.

The Wisdom of Avoiding Conditional Statements

Ivan Zakutnii — Tue, 13 Aug 2024 15:22:54 +0000

Cyclomatic complexity is a metric that measures the complexity and tangledness of code.

High cyclomatic complexity is not a good thing, quite the opposite.

Simply put, cyclomatic complexity is directly proportional to the number of possible execution paths in a program. In other words, cyclomatic complexity and the total number of conditional statements (especially their nesting) are closely related.

So today, let’s talk about conditional statements.

Anti-if

In 2007, Francesco Cirillo launched a movement called Anti-if.

Francesco Cirillo is the guy who came up with the Pomodoro technique. I'm writing this blog post right now “under the Pomodoro.”

I guess we all quickly figured out what this campaign is about from its name. Interestingly, the movement has quite a few computer scientists among its followers.

Their arguments are rock solid — if statements are evil, leading to exponential growth in program execution paths.

In short, that’s cyclomatic complexity. The higher it is, the harder it is not only to read and understand the code but also to cover it with tests.

Sure, we have kind of an "opposite" metric — code coverage, which shows how much of your code is covered by tests. But does this metric, along with the rich tools in our programming languages for checking coverage, justify ignoring cyclomatic complexity and sprinkling if statements around just based on "instinct"?

I think not.

Almost every time I catch myself about to nest one if inside another, I realize that I'm doing something really silly that could be rewritten differently — either without nested if's or without if's at all.

You did notice the word "almost," right?

I didn't start noticing this right away. If you look at my GitHub, you'll find more than one example of old code with not just high cyclomatic complexity but straight-up cyclomatic madness.

What helped me become more aware of this issue? Probably experience and a few smart things I learned and embraced about a year ago. That's what I want to share with you today.

Two Sacred Techniques for the Destruction of `if` Statements

Padawan, move each conditional check of an unknown value to a place where that value is already known.
Padawan, change your mental model of the encoded logic so that it no longer requires conditional checks.

1. Make Unknown Known

Checking something when we don't "know" it yet is probably the most common source of using conditional statements based on "instinct."

For example, suppose we need to do something based on a user's age, and we must ensure the age is valid (falls within reasonable ranges). We might end up with code like this:

from typing import Optional

def process_age(age: Optional[int]) -> None:
    if age is None:
        raise ValueError("Age cannot be null")
    if age < 0 or age > 150:
        raise ValueError("Age must be between 0 and 150")

We've all seen and probably written similar code hundreds of times.

How do we eliminate these conditional checks by following the discussed meta-principle?

In our specific case with age, we can apply my favorite approach — moving away from primitive obsession towards using a custom data type.

class Age:
    def __init__(self, value: int) -> None:
        if value < 0 or value > 150:
            raise ValueError("Age must be between 0 and 150")
        self.value = value

    def get_value(self) -> int:
        return self.value

def process_age(age: Age) -> None:
    # Age is guaranteed to be valid, process it directly

Hooray, one less if! The validation and verification of the age are now always "where the age is known" — within the responsibility and scope of a separate class.

We can go further/differently if we want to remove the if in the Age class, perhaps by using a Pydantic model with a validator or even replacing if with assert — it doesn't matter now.

Other techniques or mechanisms that help to get rid of conditional checks within this same meta-idea include approaches like replacing conditions with polymorphism (or anonymous lambda functions) and decomposing functions that have sneaky boolean flags.

For example, this code (horrible boxing, right?):

class PaymentProcessor:
    def process_payment(self, payment_type: str, amount: float) -> str:
        if payment_type == "credit_card":
            return self.process_credit_card_payment(amount)
        elif payment_type == "paypal":
            return self.process_paypal_payment(amount)
        elif payment_type == "bank_transfer":
            return self.process_bank_transfer_payment(amount)
        else:
            raise ValueError("Unknown payment type")

    def process_credit_card_payment(self, amount: float) -> str:
        return f"Processed credit card payment of {amount}."

    def process_paypal_payment(self, amount: float) -> str:
        return f"Processed PayPal payment of {amount}."

    def process_bank_transfer_payment(self, amount: float) -> str:
        return f"Processed bank transfer payment of {amount}."

And it doesn't matter if you replace if/elif with match/case — it's the same garbage!

It's quite easy to rewrite it as:

from abc import ABC, abstractmethod

class PaymentProcessor(ABC):
    @abstractmethod
    def process_payment(self, amount: float) -> str:
        pass

class CreditCardPaymentProcessor(PaymentProcessor):
    def process_payment(self, amount: float) -> str:
        return f"Processed credit card payment of {amount}."

class PayPalPaymentProcessor(PaymentProcessor):
    def process_payment(self, amount: float) -> str:
        return f"Processed PayPal payment of {amount}."

class BankTransferPaymentProcessor(PaymentProcessor):
    def process_payment(self, amount: float) -> str:
        return f"Processed bank transfer payment of {amount}."

right?

The example of decomposing a function with a boolean flag into two separate functions is as old as time, painfully familiar, and incredibly annoying (in my honest opinion).

def process_transaction(transaction_id: int,
                                                amount: float,
                                                is_internal: bool) -> None:
    if is_internal:
        # Process internal transaction
        pass
    else:
        # Process external transaction
        pass

Two functions will be much better in any case, even if 2/3 of the code in them is identical! This is one of those scenarios where a trade-off with DRY is the result of common sense, making the code just better.

The big difference here is that mechanically, on autopilot, we are unlikely to use these approaches unless we've internalized and developed the habit of thinking through the lens of this principle.

Otherwise, we'll automatically fall into if: if: elif: if...

2. Free Your Mind, Neo

In fact, the second technique is the only real one, and the earlier "first" technique is just preparatory practices, a shortcut for getting in place :)

Indeed, the only ultimate way, method — call it what you will — to achieve simpler code, reduce cyclomatic complexity, and cut down on conditional checks is making a shift in the mental models we build in our minds to solve specific problems.

I promise, one last silly example for today.

Consider that we're urgently writing a backend for some online store where user can make purchases without registration, or with it.

Of course, the system has a User class/entity, and finishing with something like this is easy:

def process_order(order_id: int,
                                  user: Optional[User]) -> None:
    if user is not None:
        # Process order for a registered user
       pass
    else:
        # Process order for a guest user
           pass

But noticing this nonsense, thanks to the fact that our thinking has already shifted in the right direction (I believe), we'll go back to where the User class is defined and rewrite part of the code in something like this:

class User:
    def __init__(self, name: str) -> None:
        self.name = name

    def process_order(self, order_id: int) -> None:
        pass

class GuestUser(User):
    def __init__(self) -> None:
        super().__init__(name="Guest")

    def process_order(self, order_id: int) -> None:
        pass

So, the essence and beauty of it all is that we don't clutter our minds with various patterns and coding techniques to eliminate conditional statements and so on.

By shifting our focus to the meta-level, to a higher level of abstraction than just the level of reasoning about lines of code, and following the idea we've discussed today, the right way to eliminate conditional checks and, in general, more correct code will naturally emerge.

A lot of conditional checks in our code arise from the cursed None/Null leaking into our code, so it's worth mentioning the quite popular Null Object pattern.

Clinging to Words, Not Meaning

When following Anti-if, you can go down the wrong path by clinging to words rather than meaning and blindly following the idea that "if is bad, if must be removed.”

Since conditional statements are semantic rather than syntactic elements, there are countless ways to remove the if token from your code without changing the underlying logic in our beloved programming languages.

Replacing an elif chain in Python with a match/case isn’t what I’m talking about here.

Logical conditions stem from the mental “model” of the system, and there’s no universal way to "just remove" conditionals entirely.

In other words, cyclomatic complexity and overall code complexity aren’t tied to the physical representation of the code — the letters and symbols written in a file.

The complexity comes from the formal expression, the verbal or textual explanation of why and how specific code works.

So if we change something in the code, and there are fewer if statements or none at all, but the verbal explanation of same code remains the same, all we’ve done is change the representation of the code, and the change itself doesn’t really mean anything or make any improvement.

What the heck is homomorphism?

Ivan Zakutnii — Sun, 19 May 2024 19:15:54 +0000

Hello. Continuing my dive into the nature of abstractions in software engineering, I've reached yet another rabbit hole.

Yes, rabbit holes turn out to be recursive in nature.

So, last time we found out that the most accurate definition of abstraction seems to be the words of Edsger Dijkstra:

Being abstract is something profoundly different from being vague... The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.

Thanks to Jimmy Koppel, it turns out that Professor Gerald Jay Sussman rightly considers abstraction too broad a "suitcase term" that denotes too many things. However, he sees two related definitions that apply to software engineering.

Definition 1 - "giving names to things produced by the second definition";
Definition 2 - Well, it is tough stuff, following from the Fundamental Theorem of Homomorphisms.

As it is tough, and I promised to do my best in writing as simply as possible... I will try, but sorry, I will most likely fail again :)

But it is so interesting!

Fundamental Theorem on Homomorphisms for Dummies

Okay, take a look at this beautifully cryptic diagram:

Let's break it down.

Imagine you have a collection of shapes: circles, squares, and triangles.

Each shape can be either red, blue, or green.

Let's define an operation called "change color" which, obviously, changes the color of any of your shapes to another one.

So, G represents all collections of our shapes of different colors.
And f is our "change color" operation.

Potentially, we can have sets of shapes in G, for which f will do the same transformation.

For example, it can turn out that f will change color to specific one based on shape form so we will have at least three groups:

Circles -> Change color to green;
Squares -> Change color to blue;
Triangles -> Change color to red.

At least, because we can already have green circles and so on, so f will not change anything, but let's not take that into account now.

Each of our groups we select into a list, and we obtain the so-called kernel ϕ, which is a list of our "group lists."

So, each list in our kernel consists of shapes that are "alike" by the f operation, meaning this operation treats shapes the same way.
But, the thing is that we consider each such nested list to be a representation of a conditional f-shape.

In some sense, we can call such representation "homotopy type." But, of course, "homotopy type" is not a standard term of "Homotopy Type Theory." We are just making analogies here.

And here we get another thing from our diagram - the Quotient Set, which is represented on it as G/K.

G/K represents a generalization of G by the f operation.

And finally, after applying our f operation, all shapes from our groups changed their colors, so we get a new set.
This new set, denoted as H, is the set of all shapes turned to another color by the f operation.

Now, putting it all together, the fundamental homomorphism theorem states that G/K behaves the same as H: a set of different representations of shapes can be manipulated in the same way as a single specific shape.

G/K is isomorphic to H.

We know this thing as duck typing: if something looks like a duck, swims like a duck, and quacks like a duck, then we consider it to be a duck, so we can manage it with just one operation: "looks_like," "swims," or "quacks."

The interesting thing is that Professor Sussman emphasizes G/K, but not the whole theory as it is.
From his point of view, we are getting another beautiful definition of Abstraction, and it might be formulated like:

Abstraction is the principle or scheme of uniting different representations that behave the same way under a given operation.

It looks like this definition varies from the dozens of other definitions of Abstraction, yet it indeed is compatible with almost any of them.

Because this idea is really powerful - it allows us to consolidate all the complex relationships between various implementations and their common abstract domain into a single "reflection."

In fact, ϕ is the "reflection" of abstraction, whereas G and G/K are concrete and abstract subject areas, respectively.

How to live with it

This potential link between "software-ish" abstractions and Abstract Algebra with Type Theories feels really sweet, but does it give us any evident and applicable mechanism which can help us to improve our mostly boring daily coding?

Well, "evident" - no. Applicable - I do believe yes. Let me explain.

Abstract interpretation does not give us any sharp tools to fight with inaccuracy.
In the previous post, I found that any correct formal definitions of abstractions (meaning in concrete subjects) essentially every time give us the conclusion - "this thing could be just anything."

Not really accurate, when implementing something we want to be as much accurate as possible.

And still, having in mind such sane definitions of Abstractions one of which we just discovered gives us the possibility to look at the subjective area from a completely different point of view.

Conversely, not having such knowledge leaves us more "blind," forcing us to move through the space of some domain in almost complete darkness. Yes, I do believe it is a good metaphor for our case.

Just try to imagine how many homomorphisms we might end up with by the end of the day, and how diverse they could be even for the same G if we design our system and its operations carelessly.

Do you feel it? This is literally the source of nightmares in poorly designed systems.

Will thinking only in terms of methods and patterns, such as those we have from the GoF, help with this?

I am not quite sure, because it feels that while we keep talking about patterns and other mainstream concepts, we are losing something really important.

The value

So the main conclusion I can extract from today's topic might be obvious to you already.

I think the most universal and truly safe way of abstracting domain ontology from the very first steps of any project is to define the most rough, generalized, and obvious entities as Abstract Data Types based on operations on them, and not on their "names".

Of course, we will most likely in every case proceed with detailing the operations on our ADT's, but we should start with the minimum number of operations designed as possible.

Considering that if f: G -> H is a homomorphism for a specific operation, it means that f preserves that operation.
Different operations might have different corresponding homomorphisms, but, of course, not necessarily for every possible operation on G.

And still, I am not mathematician, but this "might" feels like... likely.

I think that amount of possible homomorphisms is really matters.

Defining, "producing" some type/class will inevitably produce some operation for them, just because some class should do something.

When we are not thinking about operations in the first place, we might end up with a cumbersome and complicated type-system/class-hierarchy, trying to dance around polymorphism, method overloading and other ugly things.

So why should we design systems thinking about phenomena/classes/entities if the operation is the essence of homomorphism?

From this, it directly follows that calculations are the "backbone" of every software system, so it should be designed based on the operations.

It is so stupidly obvious, yet we so often stupidly ignore it, getting lost in vague "things."

It will not work in designing a complex system for big and complex technical specifications! - we could fairly argue.

And it is true. Luckily, we have incredibly useful methods of analysis and designing Abstract Data Types systems that make this process easier and even kind of straightforward.

Like the one proposed by Bertrand Meyer.

But it is a completely different story, which I will cover in another blog post.

Stay tuned, and remember -> calculations.

What is Abstraction?

Ivan Zakutnii — Tue, 07 May 2024 15:52:23 +0000

The definition of "Abstraction" in software engineering is expressed differently by various people, and often these views directly contradict each other.

I asked myself - what really is "Abstraction"? How can we precisely define the meaning behind this word?

Some say that interfaces are an abstraction; others say that interfaces are not an abstraction.

Some say abstraction is when we find logically related parts of code and separate them into a distinct component; others say that abstraction in software engineering is a process where we focus on individual parts of the model as abstractions.

What the hell is going on?

And the crazy part is that these opinions not only come from random engineering blogs but from quite serious academic papers.

Reading these materials, I realized that I don't fully understand what Abstraction in software engineering really means.

I caught myself having a very vague idea - in one context, I considered interfaces as abstractions, in another, I could call a simple function an abstraction.

I can no longer live in the darkness of ignorance, let's figure it out.

Limited Forms of Abstraction

First, I would like to start with something quite primitive, which I naively-intuitively took for abstraction.

Function.

But is a function really an abstraction?

In the usual sense, functions separate one calculation from another (not always, of course, there are nightmares).

Functions, in the context of lambda calculus, are often called abstractions. What do they abstract?

Lambda calculus is a formal system developed by Alonzo Church to formalize computability. And computability is a concept from the theory of algorithms and computer science, which defines which tasks can be solved using algorithms.

So, in lambda calculus functions, parts of an expression are abstracted into hypothetical variables representing generic input for the function.

For example, the abstraction of a function that takes one argument and returns its product by three would look something like this:

λx. x * 3

This could potentially start a holy war - to consider functions from the perspective of high-minded lambda calculus and from the side of conventional mainstream imperative and OOP programming.

Nevertheless, in our usual way of programming, by defining a function, we limit one piece of code from another, with brackets or indents, so despite disagreements in views, a function can still be seen as a form of abstraction, but a very limited form.

A form that does not answer the question - what is abstraction.

So what about unlimited form?

Maybe factorizing code into neat modules is a process of extracting abstraction?

Following the DRY (Don't repeat yourself) principle, we essentially perform anti-unification, decomposing similar parts of code into separate functions and the like.

But this is in general lead us to the same example we discussed earlier in the context of a function. If this trick can be called an abstraction, it is still very limited.

Anti-unification, particularly obsession with it, can lead to one terrible anti-pattern.

Its name is boxing.

Boxing is when we cram a huge amount of code into one big function or method, the set of parameters for which automatically grows.

Naively following the idea of de-duplication of code, we can easily loose focus on the specification and logical domain to which these parts of code belong, so we will almost certainly end up with boxing.

Think about it:

Try to find examples of logic, the actual implementation of which will be the same (or almost the same), but the specification, the sense - will have different.

Is it correct to "abstract" the implementations of such logics in our mixed domain into one function/method, even if hidden behind multiple dispatch and similar?

Not Abstractions at All

Well... Abstraction... Where it is...

Maybe the process of organizing code into separate modules, or in OOP class hierarchy is the process of abstracting things?

Yes, of course, we can split the code into several modules, hide a lot of logic in class hierarchies, and every time we will return in the project to add a new feature, we will switch between 4 files/modules to remember how and what actually works here.

Familiar?

Is this abstraction?

What's the difference between entity.getTotalSomething() and getTotalSomething(entity)?

Looking at this example from a meta level, from a specification level, I see no difference, they do the same thing.

So what are we abstracting by organizing long chains of method calls or spreading the class hierarchy across several modules?

I feel this is not abstraction, but the stirring of implementational "water" in a mortar.

Well, What About Interfaces?

An interface, in the broad sense, is essentially a mechanism for grouping several implementations of one "function", allowing a specific call of such function to any of the defined implementations.

Type classes and parametric polymorphism/generics are intended for the same purpose.

Well, Abstraction!

Here we have an implementation hidden behind some "interface", we abstracted it! No? Yes? What's wrong!?

I don't like it. I don't like that nothing forces all these different implementations to have any relation to each other.

We can create such a mess that one and the same interface will be implemented in completely different ways, or even make it so that two different implementations will perform completely opposite things.

And how can this be Abstraction?

At this stage of the search, I realized what I expect from Abstraction.

If feels that it should be some sort of specification.

We can call it "abstract specification", yes, abstract, but specification.

It seems that I have finally managed to at least approach the correct, albeit still semi-intuitive definition of abstraction.

But I still can't find an example of how abstraction looks in code, what it can be expressed by, how it is represented?

Authentic Abstraction

What was the starting point for the search - find a concrete, formal definition of Abstraction.

As I mentioned at the beginning, I tried to find it, but encountered completely opposite formulations in meaning.

The fact is that I was just looking in the wrong places.

Taking a broad step away from the mainstream towards Holy Computer Scientists of the Past, I found commonality in definitions, even in seemingly quite different areas - they did not contradict each other and carried a similar sense, which can be expressed as follows:

Abstraction is the mapping of a specific subject area of the dirty, contradictory, and confusing real world into something pure, idealized, capable of being described mathematically.

Sounds complicated, but the following quote from Edsger W. Dijkstra clarifies everything:

Being abstract is something profoundly different from being vague … The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.

Aha! Semantic Level!

So what turns out, if we define abstraction from such a perspective, then it turns out that all the previously considered "mechanisms" of programming are not abstractions after all? Or are they?

Yes, our OOP classes and even their hierarchy should somehow reflect the subject area, but the purity (as a concept) of this reflection is highly questionable.

The same applies to functions, especially dirty imperative ones filled with side effects.

The most pure and authentic abstraction that comes to mind is numbers.

The number 108 can be represented in various forms, numeral systems, signs, as a mathematical set...

Mathematical operations, like addition and multiplication, also have many different implementations, but we never think about them when using calculators.

And how many interpretations occur around electrical signals, just to see a picture of a doggo on the screen?

All this looks like when working with good abstractions we don't even realize that we are working with an abstraction.

And it follows that abstraction, in its true and authentic sense, is represented in the code as...

It is not present at all.

Let me explain my thoughts here.

Abstraction in principle cannot "be located" somewhere, because it is a representation it self, in the sense of some pattern, which we overlay on the modeled world, and not on specific entities of this world.

It is incorrect to ask:

"Is A an abstraction of B?"

It is more correct to ask:

"Will the mapping from A to B be an abstraction?
Are all the necessary operations of B preserved in the abstraction of A?"

So, stop. But we have already remembered that there is a "mapping" from each implementation of a function to its interface!

And to each call of a function, from its implementation!

Yes, that's true! But it seems we have found the reason why, looking at some interfaces in the code base, we understand nothing, and looking at others, we can make quite correct predictions about what will happen as a result of executing this code - Poor abstraction VS. Good one.

Observing and working with good abstractions, we are able to make quite accurate predictions about the outcomes of the considered code, because "Abstraction" is represented precisely, expressively, forming that very "semantic level" mentioned by Dijkstra.

Looking through such quality representation, we are able to observe the meaning of the code easily.

And yet, why does abstraction not look like anything in code?

Because it's a separate dimension altogether, and it's not that abstraction should be associated with code, but code should be associated with abstraction.

For example, if we have a method/function whose responsibility is to transfer money from one account to another.

The specifications of such a function can be described quite differently:

We can describe all the permissible variants of behavior - success or failure of the transaction, under what conditions it is performed or not performed;
We can describe it as a mapping to an abstract domain to which this function belongs. Most likely such a function will belong to the domain of some "client", hence the abstraction of this function can be described as a mapping to this domain - for example, we can say that this function should perform a transaction, and return a set of transactions after attempting a transfer - including successful and unsuccessful transactions;
Ultimately, we can simply literally describe the implementation of the function, its purpose and responsibility - "function N attempts to transfer X of something from there to there";

As we can see, any of these specifications can be used, they themselves are separated not only from the code but also from the domain, it is absurd to claim that our function is a specification (or anything else like a class to which this function belongs as a method, and so on).

Therefore, it turns out that the code is associated with abstractions, not the other way around.

Yes, with a multitude of abstractions, because such formal descriptions, if you want - ideas, can be infinite.

All abstractions are precise in their own way, within the space of meaning they try to impose. Nevertheless, we are able to determine which abstraction is more "correct", more suitable.

In general, good abstract states always contain less information than concrete ones.

And this is less obvious than it seems.

Trying to "hide" already existing "information" in the code behind some "false" abstraction, we engage in anti-unification and slide into boxing.

Focusing on the meta level, reflecting and designing code in clean, true abstractions, there are fewer chances to slide into boxing because we operate with representation and abstract domains.

So, I don't know about you, but I think I've found it.

Abstraction is a "pattern" that we overlay on the modeled external world, resulting in program code.

Only good abstraction allows, looking at the chains of calls and structures in the code, to predict the meaning and purpose of these calls precisely enough.

The goal of abstraction is to show whether specific code has some systemic property, whether it is an element of system design, a certain feature, and so forth.

Abstractions are not expressed in the code by anything or in any way.

They are very similar to abstract data types. We could even say something like - "Good abstract data types express good abstractions," but that would not be correct.

Good abstractions are completely separate from the code, and even from the abstract domain.

Good abstraction is a supra-system entity, which is not always explicitly distinguished in classes/objects, moreover, abstraction does not always need to be somehow distinguished.

I have tried to love Python

Ivan Zakutnii — Wed, 17 Apr 2024 15:03:35 +0000

This isn't clickbait, seriously. I'm not here to trash Python, just sharing my thoughts.

For the past 9 months, I've been a Platform Engineer at a company that heavily utilizes Python, and it’s not confidential, we are hiring.

This isn't my first rodeo with Python. In fact, I have quite warm story with this programming language.

It was essentially my first language, learning to code with it by solving Leetcode-like problems and studying courses on algorithms and data structures, also in Python.

That was a great time, no regrets.

Despite Python being dynamicly typed and not being able to encode an algorithm/structure "maturely" with allocs/mallocs like in C, its simplicity and expressiveness made it incredibly useful for jumping into CS, especially for beginners, like I was.

Because Python is good.

I loved Python

Actually, the first programming language I learned and wrote something in wasn’t Python, but rather JS (let’s not talk about school years because programming classes in schools in my country is a disaster).

Nevertheless, I referred to Python as my first programming language because I really did program a lot in it—various academic/learning stuff, small applied projects, scripts for automating routine tasks on my Linux PC, and during my time as a DBRE.

I dabbled in Django here and there, just a bit of everything.

I continued learning CS, programming paradigms, OOP, and I kept loving Python.

When I moved into DevOps, I was deepening my understanding of programming paradigms and getting to grips with the foundational pillars of "how things work."

That was the point in time when I embraced the holiness of static type systems, and overall, it fels like at this particular point of time I have formed the mindset and development path I should and want to follow.

Naturally, as I continued working in DevOps, I came across Go and started learning it.

I forgot Python

I started working as a DevOps engineer, not an SRE, mostly OPS.
unfortunately, there wasn’t much programming involved, and most of my work time was spent on CI/CD stuff, administering Kubernetes.

Almost at the beginning of this period, my son was born, and I didn't have much free time for side projects or significant programming, but all the free time I had continued to invest to studying Computer Science and Golang.

I fell in love with Go it almost immediatelly, its blend of simplicity and strictness.

Over time, as I had more "free time," I programmed some educational projects in Go, a few web apps, a few CLI's, a small k8s listener for cleaning up review environments.

I continued working as a Dev*OPS* engineer, dreaming of moving to full-time, or at least "more"-time software development.

Hello old friend

Eventually, I got the job I was looking for — a place where I could code more while still being in touch with infrastructure.

Because I still love infrastructure and DevOps; yeah — Platform Engineering. And all in Python.

I remember my first thoughts when I was getting ready for this job:

Damn, an interpreted language with a dynamic type system, do I... Want to write in Python?

I wanted to code more so badly that I absolutely didn't care what language I would be programming in at my next job, a job where I wouldn’t lose income and could do what I love.

I don't regret anything; these months I've learned a hell of a lot about Python compared to what I knew before.

I tried to love Python again

Before coming to my current position, I didn't know about and didn't use type annotations in Python.

I didn't know about the marvelous Pydantic; there was a lot I didn't know simply because I had never written anything really mature and big in Python, as I mentioned earlier, using it to learn programming, fundamental things in software design and other CS concepts.

I learned all this thanks to one person, and I'm grateful to him for everything. This is my onboarding buddy Stanislav Zmiev, and he really is good at Python.

Just stalk his LinkedIn

Well, of course, I can't now see Python without type annotations, I mean for big projects, and even medium ones — anything that's more than a script and has 2+ modules and is not covered by type annotations is abominable to me; I'm sorry, it just is.

This doesn't mean that I despise Python or hate it, or don't want to program in it at work, but I can't sincerely love it

I can't forget all the wonderful (and a few not so wonderful) experiences of programming in Golang. I can't close my eyes to the "interpretability" nature.

Simply put, I am not ready and will never be ready to dedicate my life to specializing in Python.

By dedicating my life, I mean programming only in Python, both at work and in side-projects.

Things get really hot

For the last few months, things have gotten really hot, so hot that I am now a Platform Team Leader. Hold my tea, I need to move some tickets.

It just turned out, I never aspired to this. But it is what it is, and it automatically means - I code less at work.

I still have free time for side projects, as I did before.

And here comes a really interesting thing that hit me just a few days ago:

For months, I racked my brain over the question, "Okay, what should I slap together in Python on GitHub, maybe this? Or that? Or start investing my time in this open-source project as a contributor? I want to code something interesting and usefull as side-hustle..."

But I was stalling. All this time, I was convinced that I was stalling because I didn't have enough free time, or because I was an inexperienced fool and couldn't figure out the Python codebase. Or because I am lazzy.

It is funny, lazzy, hehe...

Hell, no.

I just don't want to invest my most precious resource, which is free time — into programming in Python.

And that's unlikely to change.

I still am okay to write in it for money, not really matter.

My dear blue gopher, are you alive?

For almost a year, I didn't even look at what was happening with Golang, wrote nothing in it, didn't even read much news — I just didn't have time and was forcing my self in other stuff.

The language and its ecosystem seem really alive, and I'm sincerely glad about that.

I have a bunch of good ideas for side projects that I want to implement in Go, starting with simple and smaller ones, just to refresh my memory of the language (though is there much to remember?), and to get better at it.

Stas, I hope you're not mad at me if you're reading this, but knowing how you love Python and how you can be tempted by clickbaits — chances are there.

Don't smack me up.

Besides Stas, I'm surrounded by a whole bunch of incredibly talented and wonderful Python programmers, Python itself is remarkable in what it's remarkable for, and people codding in Python are just remarcable people.

I enjoy writing code in Python at work, and I repeat — I am not a Python hater, at least as long as there are type annotations...

But it's just not something that can penetrate my strictly typed, tough heart.

And unfortunately, I can't find in Python things that could melt in, things that aren't in Go, and that I would need.

Pure OOP? No, that's not one of.

But... Rust?

Ok, keep claws away.

I've tried, I don't want spare time here. At least for now.

I admire all the beauties of the Rust PL, but I am still quite disgusted by a few things.

I do know that with time you can live with it and get used to it.

Humans are really flexible things, but for now... I don't want to, and really can't see any big tradeoffs for me just to rush into Rust and spend a lot of time trying to surpass this huge steep learning curve.

I want to code things NOW in a statically typed compiled programming language that I already know well, let me code for God's sake.

Gopher is furry enough for me.

Message

The message is simple, and it is not about Python or Go:

be kind to yourself and listen to yourself carefully.

In a mad rush to make only the "right" choices, it is really easy to lose the light inside.

It is easy to forget about impermanence and that there is never just one right choice.

Things may seem right and profitable at first, second, and even third glance...
Yet they still might not be right or truly profitable.

Sometimes, it might just not be your thing — without any analytical reasoning.

Python is no longer winner of my heart in both senses — based on my honest introspective experience and from just perspective.

Do what you love, live your life with joy, and love yourself.

Peace.

HighLoad Saga. Part Three: Transaction Processing and Analytics

Ivan Zakutnii — Tue, 05 Mar 2024 17:22:25 +0000

When we say "transaction," we refer to a group of operations (read/write in various combinations) that reflect a single logical operation for data handling, executed atomically - either all changes are applied, or the database state is rolled back to its state at the beginning of the transaction.

Formally, a transaction is expected to possess ACID properties, which stand for:

Atomicity, as already mentioned;
Consistency,
Isolation, meaning the ability of one transaction to operate independently from other transactions,
Durability, implying that if a transaction is successful, its results are permanently fixed in the database, regardless of database management system failures, etc.

However, in practice, transaction processing essentially means the ability for clients to perform read and write operations with low latency and the assurance that in case of failure, data corruption or breach of its logical integrity will not occur.

A basic case of such a "user" transaction is an application using an index to find a small number of records by a specific key. Based on user-provided data, new records are added or existing ones are updated. Most such applications are interactive, leading to the access pattern being termed online transaction processing (OLTP).

Database management systems are increasingly used for analytical data processing and in data science, where access patterns and requirements differ significantly.

A typical analytical query involves selecting a vast number of records, reading only a few columns in each, and calculating aggregate statistical indicators instead of returning raw data to the user.

To distinguish this database application pattern from transaction processing, it has been named online analytical processing (OLAP).

Thus, OLTP implies arbitrary access to data and database write operations with low latency based on client data. Here, the database always reflects the current data state at the moment of the transaction, and the size of such databases can reach terabytes.

OLAP, on the other hand, usually involves either a long-term group data import, with possible transformation, known as ETL - Extract, Transform, Load, or processing a continuous stream of certain events. Typically, such a database stores not only current data but also the history of their changes, and its size is measured in petabytes.

Data Warehouses

Interestingly, SQL has proven to be quite flexible; it performs just as well in OLAP as in OLTP. Despite this, not so long ago, a trend emerged for creating special data warehouses, with large companies starting to move analytics away from OLTP systems to specialized databases.

A large enterprise has many transaction processing subsystems, each typically complex, supported by a separate team of engineers, and almost always operated independently of others.

From OLTP systems, we expect high availability and rapid transaction processing because these systems are often critical to business operations.

Due to these requirements/expectations, we are reluctant to allow business analysts to run analytical queries on these databases simply because these queries are almost always quite complex and resource-intensive, as they involve viewing large data sets, which naturally can adversely affect the performance of other transactions being executed in parallel with such analytical queries.

A data warehouse, however, is a separate database that analysts can query as they wish, without disrupting business operations.

The warehouse contains a read-only copy of data from all of the company's OLTP systems. Data is extracted from OLTP databases through periodic data dumps or a continuous stream of data updates, transformed into an analysis-friendly format, cleaned, and then loaded into the warehouse.

This is the ETL process.

Data warehouses are specifically optimized for analytical queries.

The indexing algorithms that work well for OLTP are not as effective for responding to analytical queries.

The data warehouse model is often relational, as SQL generally suits analytical queries well.

Thus, data warehouses and relational OLTP databases look similar (since both have an SQL interface for queries).

However, the internal structure of these systems differs.

OLAP Data Models

In OLTP, a wide variety of data models cater to different needs.

However, in analytics, the variety of models is much smaller.

Many data warehouses operate on a standardized pattern known as the star schema, also known as dimensional modeling.

In this model, the fact table is at the center. Each row of this table reflects an event that occurred at a specific point in time.

Depending on the nature and objectives of the business, each row of this table may reflect a page view, a user action in the system, a product purchase, and so on.

Facts usually enter the warehouse as separate events, which is very convenient, but we must always be aware of the potential for growth in the size of this table and be prepared for it.

In large corporations, data warehouses store petabytes of transaction history, with fact tables making up a large portion of this history.

If a row in the fact table reflects an event, then dimensions correspond to the characteristics of "who," "what," "where," "when," "how," and "

why" related to this event.

In short, part of the columns in the fact table are the event's attributes, and the rest are foreign keys to dimension tables.

That's why this schema is called a star.

In an alternative version of the "star" pattern, known as the "snowflake" schema, data is further divided into sublevels.

Instead of storing all dimension information in one table, the "snowflake" schema uses separate tables, sub-dimensions, for more detailed categorization of data.

The "snowflake" schema provides a higher degree of data normalization compared to the "star" schema, making data more structured and orderly. However, the "star" pattern is more commonly used in business analytics for its convenience and simplicity.

Columnar Storage

In a data warehouse, tables can be extremely wide. This applies to both the fact table, which may have hundreds of columns, and dimension tables, which can become very wide due to the many metadata columns required for analytics.

If fact tables contain trillions of rows and petabytes of data, storing and querying them efficiently becomes a challenging task.

Dimension tables are usually much smaller, so we will focus on fact storage.

Although fact tables often exceed 100 columns in width, a typical query to the warehouse only accesses a few columns!

In most OLTP databases, storage is organized row-wise: all values from one table row are stored next to each other. Document-oriented databases are similarly structured: the entire document is usually stored as a continuous byte sequence.

If OLAP databases followed a similar row-wise implementation, the storage subsystem would need to load all rows (each with 100 or more columns) into memory, parse them syntactically, and filter the result based on the conditions specified in the query.

This is highly inefficient.

The idea of columnar storage emerged as an optimization: storing values from the same column together, rather than from the same row.

If each column is stored, for example, in a separate file, a query only needs to read and parse the required columns.

However, column-based data storage requires that the files of all columns contain rows in the same order. Consequently, to assemble a whole row, e.g., the 128th row, one must take the 128th element from all the column files and compile them together.

Besides loading only the columns needed for a query from the disk, it's also possible to further reduce disk bandwidth requirements by compressing the data.

Since columns often contain similar and semantically close values, they compress very well.

Depending on the data contained in a column, various compression methods are applied. One of the methods, particularly effective for data warehouses, is bitmap encoding.

Sorting Order in Columnar Storage

In columnar storage, the order of row storage, at first glance, seems to have little effect.

In a previous post, we discussed the simplest write operation - appending to the end of a file. And this looks like a good option for such storage.

But we can set the write order, just like in SSTables, and use it as an indexing mechanism.

Obviously, sorting each column separately makes no sense, as it would then be unclear which column elements belong to the needed row.

Remember, in columnar storage, we can assemble rows only because we know: the nth element of one column and the nth element of another column always correspond to the same nth row.

So, if we want to sort something here, we need to sort the rows as a whole, despite the fact that data is stored by columns.

Such sorting is useful if specific columns can be selected for sorting the table, based on knowledge of the most frequently executed queries.

Sorting also helps to compress columns effectively. If the main sorting column has a small number of different values, long sequences of repeating identical values will appear after sorting. Simple encoding, for example, using bitmap schemes, can compress such a column to just a few kilobytes even with billions of rows in the table. Compression works best for the first sorting key. The second and third keys will be more mixed, hence, they won't have such long sequences of repeating values.

For different queries, different sorting orders are better, so why not store differently sorted copies of the data?

This serves as both data replication and optimization of typical queries.

Having multiple sorting orders in columnar storage is akin to a group of secondary indexes in a traditional row-based storage.

But an important difference is that in row-based storage, each row is stored in one place - in an unordered file or a clustered index, and secondary indexes only contain pointers to the corresponding rows.

In columnar storage, there are usually no pointers to data - only columns containing values.

Writing to Columnar Storage

Data warehouses allow for various forms of optimization, as most of the load falls on the voluminous read-only queries performed by analysts. Columnar storage, compression, and sorting significantly speed up the execution of such queries. However, these warehouses have a serious drawback in the form of complicating write operations.

The approach of updating data in place, used by B-trees, is impossible in the case of compressed columns. If it's necessary to insert rows in the middle of a sorted table, most likely, all column files will need to be rewritten. Insertion must update all columns in a coordinated manner -- since rows are defined by their position in the column.

Fortunately, we've already explored a good solution to this problem: LSM-trees. Everything is first written to storage in RAM, where data is added to a sorted structure and prepared for disk writing.

It doesn't matter whether the storage in RAM is columnar or row-based. Once enough data has accumulated, it is merged with the disk's column files and written in blocks to new files.

Memory Bandwidth and Vectorized Processing

A significant bottleneck for data warehouse queries is that they have to scan millions or billions of rows, which becomes the bandwidth limitation for moving data from disk to memory.

Moreover, for analytical databases, the efficient use of the data transfer rate from RAM to the CPU cache, avoiding various kinds of branch prediction errors and "bubbles" in the CPU instruction processing pipeline, as well as using the vector instructions (SIMD) of modern processors, becomes a problem.

Besides reducing the volumes of data loaded from the disk, columnar storage schemes can also efficiently use CPU cycles.

For example, the query processing subsystem may take a portion of data that fits well into the L1 cache of the processor and pass through it in a continuous loop.

The processor can execute such a loop much faster than code containing many function calls and conditions for each processed record.

Column compression allows more rows for one column to fit in the same volume of L1 cache.

For working directly with such portions of compressed columnar data, classic bitwise OR/AND operations can be used. This method is known as vectorized processing.

Aggregation: Data Cubes and Materialized Views

Not all data warehouses for OLAP tasks are necessarily columnar: row-based DBs and several other architectures are also used. However, columnar warehouses perform much faster for arbitrary analytical queries, so their popularity is growing rapidly.

Another important feature of data warehouses is materialized aggregate indicators. Queries to warehouses often include aggregation functions, such as COUNT, SUM, AVG, MIN, or MAX in SQL.

If the same aggregation functions are used in a variety of different queries, it would be wasteful to recalculate raw data from scratch every time.

Why not cache some of the aggregate indicators most frequently used in queries if the data hasn't changed?

One way to create such a cache is a materialized view.

In the relational data model, it is often described similarly to a regular view: it is a table-like object which content is the result of some query.

The difference is that a materialized view is an actual copy of the query results, written to disk, while a virtual view is just a shorthand form of writing for queries.

When reading from a virtual view, the SQL engine dynamically unfolds it into the underlying query, and then fully executes this unfolded query.
OLAP Cube

In the case of changes to the data used in the aggregated query, it's necessary to update the materialized view, as it represents a denormalized copy of these data.

The DBMS can perform updates automatically, but such manipulations increase the cost of write operations, so materialized views are rarely used in OLTP databases.

However, in warehouses, where the main load falls on read operations, it makes sense to actively use them. Whether they actually improve the performance of read operations depends on the specific case.

A common case of a materialized view is a data cube, or OLAP cube. It represents a grid of aggregate indicators, grouped by different dimensions.

For example, each fact in the fact table has foreign keys only to two dimension tables, let's say, date and product.

We can build a two-dimensional table with products on one axis and dates on the other. Each cell contains an aggregate indicator of an attribute of all facts with such a combination of date and product.

Then, we can apply the same aggregating function along each row or column and get totals, reduced by one dimension.

In general, facts often have more than two dimensions.

Formally, setting up a five-dimensional hypercube is much more difficult, but the idea remains the same: each cell contains sales for the corresponding combination of date, product, store, advertising campaign, and customer. These values can then be sequentially grouped by each dimension.

The advantage of materialized cubes is that certain queries will be executed very quickly because the data for them were essentially pre-calculated.

For example, if we need to know the total sales volume for few days back by each partner, we just may take a look at the totals by the corresponding dimension - there's no need to analyze millions of rows.

The drawback of this approach is that data cubes don't have the same flexibility as queries to raw data.

OLAP vs OLTP

In practice, except for analysis tasks, mainly OLTP systems are used, implying potentially a huge number of queries.

Programmers optimize the load, trying to affect a limited number of rows in each query.

The client program requests records using a specific key, and the storage subsystem employs an index to find data with the corresponding key.

The bottleneck here usually becomes the time to move to the required position on the disk.

Data warehouses are mainly used by business analysts and handle much fewer queries than OLTP systems.

However, almost all OLAP queries are usually very resource-intensive and require viewing millions or even billions of rows in a short time.

The bottleneck here usually is a disk bandwidth. For this reason, columnar storages are gaining popularity for this type of task.

In OLTP, there are two main approaches to building data storage subsystems.

The log-structured approach, which only allows appending data to files and deleting outdated files, not updating a written file. These are SSTables, LSM-trees, HBase systems, Lucene, etc.
The in-place update approach, which views the disk as a set of pages of a certain size that allow rewriting. The main representative of this philosophy is B-trees, used in all major relational databases, as well as many non-relational ones.

The relatively new thing is log-structured storage systems. Their main idea consists of systematically converting any disk write into a sequential one, providing higher write throughput due to the performance characteristics of hard drives and solid-state drives.

And for this technical points analytical OLAP tasks differ so significantly from OLTP tasks: when queries require sequential scanning of a large number of rows, indexes play no special role. Instead, encoding data very compactly to minimize the volume of data read from the disk becomes much more important.

But we will talk about data encoding formats and few other things next time :)
Thanks for reading.

HighLoad Saga. Part Two, Chapter 2: Data Storage and Processing Subsystems

Ivan Zakutnii — Sun, 25 Feb 2024 06:46:41 +0000

So, databases are data storage systems. Essentially, they should solve just two tasks: save the data received from the client and provide data in the future as a response to specific requests.

A good question is - should an application developer know how the DBMS is structured internally? Specifically, how is data stored and how does the search for this data work?

It sounds like a closed question, but I believe the correct answer goes something like this: By default, a developer should not know how to implement a data storage subsystem from scratch (we are talking about us - JSON bricklayers, not system developers hired on the project to implement a new ultra-fast DBMS, of course), but the developer should understand the practical significance of different storage subsystems, their pros and cons, to be able to choose the suitable one for the project being developed.

Well, in general, the short answer to this closed question then is a yes :)

We need to have at least a rough, conceptual idea of the storage mechanisms' structure to select and optimally configure the DBMS for our load.

Basic Data Structures in DBMS: Key-Value

Typically, a database stores unique keys, each corresponding to some value. In general, the types of these values can vary.

The simplest form of implementing the key-value model is to write data in a file as lines that contain pairs of keys and values serialized into a text/string format.

The performance of adding a new key-value pair in this case is high if keys are always generated as uniquely new (current timestamp, for example) and updates to existing records are not allowed.

Under such conditions, we simply append new lines at the end of the file.

Many DBMSs use a (mainly for specific purposes) essentially similar mechanism called journaling: there is a file, called a journal (or log, or whatever), and it is intended only for appending data at its end.

Obviously, in practice, such logging is implemented in a more complex manner, as it requires managing concurrent access to the journal, controlling disk space, creating backups, handling errors, and correcting corrupted records. Nonetheless, the fundamental principle remains the same.

To optimize the use of disk space, journals are usually divided into segments of a certain size. Once a segment reaches this size, its file is "closed", and subsequent data are recorded in a new segment.

In addition to this, there often exists so-called compaction technology for existing segments, which involves discarding duplicate keys from the journal and keeping only the latest version for each key.

Many logging/journaling technologies implement similar mechanics of segmentation.

Indexes

So, we manage to write to the journal efficiently, but reading a value by key would work terribly inefficiently in the case of a large number of records in our journal.

We end up with linear complexity O(n), because each time we would have to search through our entire "database" from start to finish looking for the required key.

To ensure efficient searching for any key in our storage system, we need another data structure - an index.

The main essence of an index is that it stores additional metadata, a kind of "pointers", which help to find the necessary records much faster.

Typically, the ability to add and remove indexes in most DBMSs does not affect the content (actual data) of the database, but modifying indexes affects the performance of queries.

Maintaining indexes leads to additional costs, especially when the database writes new values to disk, because indexes need to be updated each time to keep them up to date with the new data.

In such a scenario, it is practically impossible to outperform the efficiency of simply appending data to the end of a file (since this is the simplest of all possible disk write operations).

This is an important compromise in data storage systems: well-chosen indexes speed up read queries but slow down writing.

This is why DBMSs do not index everything by default but offer developers the chance to choose indexes manually, based on their knowledge of query patterns typical for the application.

Hash Indexes

The key-value data model is somewhat "similar" to the dictionary (map, etc.) data structure.

Or rather, dictionaries are usually implemented as hash maps or hash tables.

Since hash maps are used for in-memory data structures, and such structures are extremely convenient, how about using hash maps for indexing data on disk?

Imagine we are still dealing with our simplest storage, which operates solely by appending data to the end of a journal file.

We start to build and keep a hash map in memory, where each key corresponds to a physical offset (relative address) in the data file - essentially a pointer to the specific location where the value is located.

File operations in every OS allow the operation of positioning the cursor within the file with byte accuracy - the seek operation.

As a result, when adding a new "key-value" pair to the file, our hash map is updated to reflect the address of the data just written. This implementation works efficiently both when inserting new keys and when updating the values of existing ones.

Just in case - hash maps can serve not only for indexing pointers to stored values but also for indexing any metric of interest.

For example, a video URL might serve as a key, and the value could be the number of views, which increases each time someone requests playback.

With this type of load, the number of write operations will be high, but there will not be too many unique keys. Therefore, in such a scenario, it is perfectly acceptable to store all keys and values in memory.

This concept is implemented by in-memory databases, which entirely (or in large blocks) fit into RAM, achieving very high performance.

To ensure their reliability, data from memory is regularly dumped as snapshots to a fast (hopefully) disk.

So, we run into the fact that the hash table must fit into memory, and if there are really a lot of keys in the database - we're in trouble.

SSTables and LSM Trees

Developing our scenario further, let's try to change the format of segment files (remember the journal segments?).

The new requirement is to sort the sequences of "key-value" pairs by key.

The first thing that comes to mind is that this requirement will prevent us from writing sequentially to the file, and it seems that other problems related to working with the journal may arise.

A valid concern, but first, let's consider the advantages we get.

The new format we are describing is called SS-table, or "sorted string table."

The implementation of SS-tables also requires ensuring that each key is unique in the combined segment. We already talked about this at the beginning, mentioning that key uniqueness is achieved through a compaction process.

In an SS-table, to find a specific key in the file, it's no longer necessary to store indexes of all keys in memory because all keys are sorted, and access to them is pretty fast.

Moreover, the compaction of segments is performed efficiently and simply, even if the file sizes exceed the available memory space (The approach is similar to that used in the merge sort algorithm).

Sorted data structures can be stored on disk, but it's much easier to do this in memory - using tree-like data structures such as red-black or balanced binary trees.

Using such structures, we can very quickly add keys in any sequence and then read them in the desired order.

One significant and obvious problem remains - in the event of a fatal database failure, recently written data, which has not yet been written to disk, is lost.

A common practice to combat this problem is to maintain a separate journal on disk, into which all new data being written is added immediately.

Such a journal does not need to be sorted in some specific way, because its main and only purpose is to restore the data structure in memory when the database starts up after a failure.

Storage subsystems based on the principle of merging and compacting sorted files are often called LSM storage subsystems, derived from Log-Structured Merge-Tree, or LSM-Tree.

An LSM-tree-based algorithm may perform slowly when searching for missing keys, because it requires scanning the entire structure in memory firstly, and then scanning all segments up to the oldest one, reading each of them from the disk before it can be confirmed that the key is missing.

To optimize such access, storage subsystems often employ a Bloom filter.

A Bloom filter is a probabilistic data structure that provides an efficient way to test whether an element is a member of a set. It can quickly tell if a key is definitely not in the database or might be, with a small chance of false positives. The beauty of a Bloom filter lies in its ability to perform these checks in nearly constant time, significantly reducing the need for disk reads when searching for keys that do not exist in the database.

B-Trees

Although journaled indexes are a popular technology, they are not the most common type of indexes.

The most widely used index structure is the B-tree.

B-trees were introduced 50 years ago and remain the default implementation of indexes in nearly all relational database management systems to this day.

Many non-relational databases also use B-trees for indexing.

B-trees have one thing in common with SSTables: they store key-value pairs sorted by key, allowing for efficient key lookups and range queries.

While journaled indexes divide the database into segments of variable size and always write them to disk sequentially, B-trees divide the DB into blocks or pages of a fixed size, typically 4 kilobytes, reading or writing one page at a time.

This structure is better suited to low-level hardware since disks in the file system are also divided into blocks of a fixed size.

Each page has an address, allowing pages to refer to other pages.

These references form a tree of pages. One of the pages is designated as the root of the B-tree - it is the starting point for any key search in the index.

The root page contains several keys and links to child pages. Each of the child pages corresponds to a continuous range of keys, and special keys, located between pointers, indicate the boundaries of these ranges.

The number of links to child pages on a specific B-tree page is called the branching factor.

The algorithm ensures that the tree remains balanced - the depth of a B-tree with n keys will be O(log n).

For most databases, trees of three or four levels deep are sufficient, so the DBMS does not have to follow many page references to find the required one.

A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 terabytes of information.

The basic write operation of a B-tree involves rewriting its page on disk with new data.

It is assumed that such rewriting does not change the page's location in the tree, meaning all references to it remain unchanged.

This is a significant difference from journaled indexes, for example, from LSM-trees, where files are only appended to, and outdated files are gradually deleted in the compaction process, but existing files are not modified.

Optimization Challenges

In-place page updating involves certain complexities - when multiple threads access a B-tree simultaneously, careful management of concurrent access is required; otherwise, a thread may traverse the tree in an inconsistent state.

The mechanism for correct concurrent access is usually implemented using latches - a lightweight type of lock.

The key difference between latches and locks lies in their scope and duration. Locks are typically used to manage access to database data at a higher level, such as rows or tables, and can be held for longer periods, such as the duration of a transaction. They are primarily concerned with ensuring data integrity across concurrent transactions. Latches, on the other hand, are used at a lower level to protect specific data structures in memory and are held for very short durations. Their main purpose is to ensure the consistency and integrity of these structures during concurrent access, rather than managing broader transactional data access.

The journaled approach is simpler in this sense, as merging occurs in the background, without interfering with incoming requests, and with periodic atomic replacement of old segments with new ones.

But we are not always dealing with the rewriting of only one page.

As pages are limited to a small size, values exceeding this size are split across multiple pages. Consequently, when updating such a value (or inserting a new, large value), we need to update/write more than one page, as well as rewrite their parent page to update links to all child pages.

In the event of a database failure at the moment when only part of the page is written, we end up with a corrupted index. A typical reason is the appearance of an orphan page, which has no parent because the process of updating links was not completed during the failure.

To ensure DB resilience, implementations usually include an additional data structure on disk - a write-ahead log, also known as a redo log.

Journals, here we go again.

This log is a file or series of files intended only for appending information, where all modifications to B-trees (and modifications to other DBMS pages that have not yet been written to disk - i.e., stored in the DBMS buffers) are recorded before being applied to the pages themselves.

When the database recovers from a failure, this log is used to restore and bring the B-tree to a consistent state.

Some DBMSs also apply a write-on-copy scheme. The modified page is written to a different location with the creation of new versions of parent pages in the tree, pointing to this new location. This approach is also useful for managing concurrent access.

For optimization purposes, space on pages is sometimes saved by storing not the entire key but only its shortened version.

Additional pointers can also be added to the tree. For example, each leaf page may link to its left and right pages at the same level, allowing for an orderly traversal of keys without returning to parent pages.

Pros and Cons of Different Approaches

Generally, LSM-trees usually perform better in writing, while B-trees excel in reading.

Reading in LSM-trees is slower because it involves scanning several different data structures and SSTables at various stages of compaction.

A B-tree-based index must write each piece of data several times, at least twice - once in the write-ahead log and then onto the tree page itself.

Let's not forget about page splitting...

Journaled indexes also rewrite data several times due to repeated compaction and merging of SSTables, but clearly, these operations do not occur with every new entry write.

The effect where one database write operation leads to multiple disk write operations is known as write amplification.

If we have our server in a closet, this factor is very important to consider, especially when using SSDs, which can only rewrite blocks a limited number of times before they wear out :)

In applications requiring large volumes of writing, the bottleneck may be the speed at which the database writes data to disk. In this case, write amplification directly impacts performance - the more the storage subsystem writes to disk, the fewer write operations per second it can perform within the available disk bandwidth.

Other Index Structures

Key-value indexes are very similar to the primary key index in the relational model.

A primary key uniquely identifies a single row in a relational table, one document in a document-oriented database, or one vertex in a graph database.

In relational databases, it's possible to create multiple secondary indexes in one table using the CREATE INDEX command. The presence of such indexes is often critically important for the efficient execution of JOIN's.

The main difference from the primary key is that secondary keys are not unique by default - meaning that several rows can share the same key.

This problem can be easily solved by making all keys unique, adding a row identifier to them.

Both B-trees and journaled indexes can be used as secondary indexes.

Composite Indexes

We previously discussed indexes that map a single key to a value.

However, they are not sufficient in cases where a query involves multiple table columns or document fields simultaneously.

Composite indexes are used to optimize such queries.

The most common type of composite index is the concatenated index, which simply combines several fields into one key by attaching one column to another.

When creating an index, its description specifies the order in which the fields are concatenated.

Suppose we have a Books table containing the Author, Title, and Year columns. We want to optimize queries that filter books by both author and publication year simultaneously.

In this case, we can create a composite index that concatenates the Author and Year fields into one key.

CREATE INDEX idx_author_year ON Books (Author, Year);

When creating such an index, we specify that the Author field comes first, followed by the Year. This index will be effective for queries like:

SELECT * FROM Books WHERE Author = 'Stephen King' AND Year = 1994;

However, if we want to find all books published in 1837 without specifying the author, our index will be less useful because the search starts with the Author field, which is not specified in this case.

In such cases, the DBMS query planner might choose another index or even opt for scanning the entire table as the search algorithm, which could be less efficient.

Multidimensional Indexesi

Multidimensional indexes are another way to query multiple columns at once, especially important for working with geospatial data.

For example, a mapping service might use a database with the latitude and longitude coordinates of all stored objects (e.g., restaurants, barbershops, schools, etc.).

When processing a query, it needs to find all objects located within a rectangular area on the map (the frontend determines which area of the map the user has selected).

Then, two-dimensional queries, for example, for searching schools, might look like this:

SELECT * FROM schools
WHERE latitude > 40.7128 AND latitude < 40.7484 AND
      longitude > -74.0060 AND longitude < -73.9732;

Standard indexes based on B-trees or LSM-trees cannot provide efficient execution of such a query. They would efficiently return either all schools within a certain latitude range with arbitrary longitude or all schools within a certain longitude range but at any point between the North and South poles - and cannot do both simultaneously.

To support geospatial queries, specialized spatial indexes are used, such as R-trees.

Multidimensional indexes can be used not only for geographical coordinates but also for optimizing any multidimensional queries:

Searching for products by color (RGB): to find products within a specific range of colors;
Movie or book recommendation systems: Multidimensional indexes can be used to optimize queries in recommendation systems, where each item (e.g., movie or book) is rated on several criteria, such as genre, release year, and rating;
Warehouse inventory management: In databases tracking warehouse inventories, multidimensional indexes can be used to optimize queries on multiple attributes simultaneously, such as product type, batch size, and expiration date.

And so on.

Full-Text Search and Fuzzy Indexes

The indexes we've discussed are excellent for searching for precise values within specific ranges.

However, they lack the capability to search for keys that may contain errors or resemble each other — for instance, words with spelling errors. To manage such fuzzy queries, alternative approaches are necessary.

Full-text search systems typically offer the ability to extend a word search by including synonyms, disregarding various grammatical forms of the word, searching for words located near each other in the text, and supporting additional functions based on linguistic text analysis. They utilize a range of technologies, from Levenshtein algorithms for finding words within a certain "edit distance" to machine learning methods.

In-Memory Databases

In many scenarios, data sets are relatively small, making it feasible to store them entirely in memory.

This concept led to the development of in-memory databases, aptly named for their storage method.

Some key-value stores, like Memcached, utilize a cache only when data loss is permissible in the event of a machine reboot.

However, other in-memory databases aim to ensure full data persistence through:

Writing change logs to disk,
Periodically saving state snapshots to disk,
Replicating the memory state to other machines.

Additionally, employing uninterruptible power supplies or NVDIMM memory, which retains data through server restarts, is encouraged.

Paradoxically, the fact that entirely in-memory-operated databases do not need to read data from disk does not always give them a significant performance edge over traditional database management systems.

Modern data storage systems and operating systems are capable of caching recently used data in memory, thereby eliminating the need to read it from disk if the server's memory capacity is sufficiently large for such a DBMS and if most operations involve reading data.

In-memory databases may offer increased speed by reducing the overhead associated with serializing data structures for disk storage.

Moreover, in-memory DBs present unique advantages in terms of implementing data models that are challenging or inefficient to execute using disk-based indexes.

For instance, Redis provides interfaces for various data structures, like priority queues and sets, making it resemble a database. All information is kept in memory, streamlining the implementation of such interfaces.

The architecture of in-memory databases can also scale relatively easily to support data sets larger than the available memory, avoiding the overhead typical of architectures that actively rely on disk writing.

The anti-caching technique permits the removal of infrequently used data from memory to disk, with the possibility of reloading it when needed. This method is similar to how operating systems manage virtual memory and swap files, but with more precise control at the level of individual records, allowing databases to manage memory more efficiently than the operating system. Nevertheless, for this strategy to be effective, it's essential for the indexes to fit entirely in memory.

Coming Up Next

In this post, we've explored various index structures, delving into their essential nature and the impact of their inner workings. This knowledge should aid us in better understanding data storage options, enabling us to select and optimize storage solutions that bring the most benefits to our projects. However, our exploration doesn't end here. Stay tuned, as next time we will take a look on transaction processing and analytics.

Cover photo by Nothing Ahead

HighLoad Saga. Part Two, Chapter 1: Storing the Data

Ivan Zakutnii — Fri, 09 Feb 2024 17:53:52 +0000

Modern tools and technologies are no longer fit in classical DBMS categoris.

DBMS is database management system. The software like PostgreSQL and so on.

These technologies, optimized for broad scenarios, continue to blur traditional boundaries.

For instance, in-memory key-value systems like Redis can surprisingly serve as message queues, while systems like Kafka offer message queues with added storage reliability.

Addressing complex System Design challenges involves decomposing them into smaller problems and applying a wide range of tools.

This includes using additional layers like Memcached for caching or Elasticsearch and Sphinx for full-text searching, which operate separately from DBMS.

Synchronizing these caches is crucial to ensure data consistency and non-contradictory results for users, and maee it so is a responsibility of our software.

Modern apps often build on a layered data model approach, each layer presenting its data to be correctly understood by the next.

This complexity increases as systems scale, despite abundant resources on relational data modeling.

So, choising right data model is greatly influences the functionality of the software based on it, it is extremely important to choose a model that is suitable for your specific task.

Yet, the industry frequently defaults to relational models and SQL, and I am woundetind = if this approach is always appropriate?

Lets take a look at some of approaches focused on data storage and query execution, probably we will find the answer for this questions.

SQL data model

The well-known SQL data model, created in 1970 by Edgar Codd, organizes data into tables.

However, it faces an object-relational mismatch.

Almost every mainstream framework, especially in web development, offer an object-oriented approach, which implies mapping the structure of objects to the fields of relational database tables.

As a result, we have a clumsy intermediate layer between objects and the database model which is pretty hard to formalize good enought.

This disjunction between fundamentally different computational models is also referred as impedance mismatch.

Impedance mismatch is only partially addressed by ORM tools.
They tries to eliminate the differences between the two models, but just not able to do it, introducing more complexity and inefficiency.

Oh, hello there - NoSQL!

That the poing where NoSQL data bases coming onto the stage.

What is NoSQL? Well, it is database management systems which are sotres unstructured documents. That why such database called as document-oriented.

JSON-representations are have better locality compared to a multi-table schema.

Locality means that all related data for some record is stored in the same document, it just all in plase.

Whereas, to be able fetch all needed data about some entity we will probably forced to make several queries or come up with some complicated "all-side" joins of that entity table with all related tables.

For example relations "one-to-many" of some user profile with all the other related, detailed characteristics imply a tree-like data structure, and the JSON representation makes this structure explicit and clear.

While storing all the related data in one JSON documents might be handy, there is a significant drawback of data duplication.

In a JSON document-oriented database, you might store each employee's information in a single document like this:

{
  "employeeId": "E123",
  "name": "Pepe the Frog",
  "department": "Platform Engineering",
  "role": "Backend Developer",
  "contact": {
    "email": "pepe.feels-good@froggie.croack",
    "phone": "555-1234-666-99"
  },
  "projects": [
    {"projectId": "P1", "projectName": "Project Alpha"},
    {"projectId": "P2", "projectName": "Project Beta"}
  ]
}

This results in massive data duplication:

The details of "Project Alpha" and "Project Beta" are replicated in full for each employee involved, rather than being stored once and referenced many times.
If project details change (e.g., the name or deadline), they must be updated in every employee document that includes that project, increasing the risk of inconsistent data.
This duplication can lead to increased storage requirements and slower queries as the amount of redundant information grows, especially in larger organizations with many employees working on the same projects

The idea of eliminating duplication lies at the heart of the database normalization concept.

But data normalization often requires organizing "many-to-one" relationships, which fit poorly into the document model.

In relational databases, it is considered normal to refer to rows in other tables by a unique identifier, since performing joins is not a problem.

In document-oriented databases, tree-like structures do not need joins, and if support for joins exists, it is often very weak.

When the database managment sysmtem does not support joins, they must be performed in the application code through a set of database queries.

There is another relations type - "many-to-many". Suth relations a regular case in relation database, but for NoSQL there has been an endless discussion on how best to represent "many-to-many" relationships.

Essentially, in the scope of this question we are going back to the very first days of digital database systems.

It is CO-DA-SYL, Harry!

Back in the 1960 there was hierarhical DBMS called IBM IMS.

It was a pretty good with "one-to-many", just like modern document-oriented dbs, but "many-to-many" was not an option for IBM IMS at all, as also it was not supports joins at all.

There was a huge volume of denormalized, duplicated data, and developers spend a lot of time trying to controll data actuallity.

To get rid of hierarhical model limitations there was proposed a few solutions. The two most know - relational model (later formed as SQL) and network model CODASYL.

The CODASYL model became a generalization of the hierarchical model, in whose tree-like structure each record had exactly one parent record.

In the network model, each record could have multiple parents. This made it possible to model "many-to-one" and "many-to-many" relationships.

The "links" between records in the network model was not like relational model's foriegn keys, but more something more like a pointers in programming languges.

The only way to query some recoed was a traverse from the root record all way through these links.

It the trivial scenario it was the same as traversing a linked list - with a O(N) complexity we are going throug all the nodes until we find one we are looking for.

But CODASYL was designed for more than just simple scenarios. As mentioned, it accommodated "many-to-many" relationships, meaning there were multiple "routes" leading to a single record.

Developers working with the network model had to be mindful of these complex access paths, as modifying the database's schema could easily disrupt them.

Eventually, members of the CODASYL committee acknowledged that navigating through an n-dimensional data space has a very high algorithmic complexity.

So, on the other hand, relational model exposed all the data as it was - tables is just a set of tuples, without any complex "labirints" should be passed to access required data.

But here we go again...

Despite historical challenges with hierarchical and network models, NoSQL databases are gaining more and more popularity.

Why is that?

We need to scale, and sclae fast!. Much quicker than relation database are capable of. We need to process very large data sets or have very high write throughput;
We might need to make a specialized querie that are poorly supported by the relational model;
We are disappointed by the limitations of relational schemas and we want more dynamic and expressive data models.

Document-oriented databases now are going back to hierarhical model, primarily in terms of storing nested records - "one-to-may" stored in the parental record, without orginizing any separate "table".

But is we want to come up with "many-to-many", or "many-to-one", well... relational and document-oriented databases are not significantly differs in their approaches.

In the both of these scenarios query for requeired element is performed with the help of unique idenitfier, which is a "foreign key" in the relational model, and a "document reference" in the document model.

This identifier is resolved at read time through joins or additional queries.

Document-oriented database are not following the path laid by CODASYL, and generally, I guess it is fine.

Document versus Relation

If our application's data structure is document-like (i.e., it represents tree of "one-to-many" relations, usually loaded all at once), using a document model is likely a good choice.

The relational model enhances productivity for "many-to-one" and "many-to-many" relationships, albeit with some theoretical limitations. For instance, it requires shredding—breaking down the document-like structure into multiple tables.

When choosing a document-oriented database, consider the following:

Direct references to nested elements within a document are not possible in the document model. This requires encoding paths in a way reminiscent of the hierarchical model, which usually isn't a problem if document nesting is limited to few levels (2-3).

-- The lack of robust support for joins in document-oriented databases may or may not be an issue, depending on the project. For instance, "many-to-many" relationships may not be necessary for a content management system where articles are stored with embedded comments. But this limitation becomes significant somthing like social networking apps, where complex "many-to-many" relationships between users and groups are essential.

Reducing the number of necessary joins through denormalization is possible, but it requires extra maintenance work to keep the denormalized data consistent.

Database Schemes

Most document-oriented databases, as well as relational databases in scenarious of using JSON or XML records, do not enforce a specific schema to define a unified data structure within documents, unlike the strict structure defined for tables in the classical relational data model.

The absence of a schema allows us to insert any keys and values into a document, and when reading these documents, we have completely no guarantees or annotations explaining what fields are present in it.

Thus, document-oriented databases are often referred to as unstructured, schemaless, among other terms.

Such schema-less approach is called schema-on-read. It implies that the data structure is implicit, and data interpretation occurs when records are read from the database.

Conversely, schema-on-write is the traditional approach of relational databases, always having a clearly defined, guaranteed data schema that all written data must match.

Schema-on-read is similar to dynamic type checking in programming languages, whereas schema-on-write resembles static type checking.

There's no definitive for this "holywar" question, answering which approach is better.

Data Locality and Queries

Documents are stored in the database as continuous strings, serialized into JSON, XML, or their binary variants (BSON in MongoDB, or JSONB in PostgreSQL).

If an application frequently needs access to the entire document, data locality offers advantages: a single query can retrieve the whole hierarchical structure.

If data is spread across multiple tables, multiple queries and index searches are needed to fully extract them, requiring more disk operations and taking more time.

Data locality is only advantageous when large parts of a document are needed at once because the database has to load the entire document, even if only a part of this document is needed.

This is inefficient.

Updating a document, almsot in all cases requires rewriting it entirely.

Therefore, it's generally recommended to minimize document size and avoid write operations that increase size when working with document-oriented databases.

These performance limitations significantly narrow the scenarios where document-oriented databases are more useful than relational ones.

And because of it relational and document-oriented databases are becoming more similar.

It seems the future lies in hybrids of relational and document models.

PostgreSQL, in particular, handles JSON brilliantly already!

Query Methods - SQL

SQL is a declarative language that was a breakthrough for the relational model, while IMS and CODASYL used imperative code for their queries.

In declarative query languages, a "pattern" of expected data must be described. We set conditions for the resulting data and, posddibly, conditions on how this data must be transformed (sorted, for example).

SQL does not give a heck on how the results should be achieved.

The decision on which indexes and join methods to use and in what specific order to execute parsed parts of a query is made only by the database managment system's query optimizer.

Thus, declarative languages are well-suited for parallel execution as they define only a results pattern, and not the algorithm for obtaining them.

Graph Data Models

Through our exploration, we've seen that hierarchical models don't fit well with "many-to-many" relationships, and relational models only handle the basics of these interactions. As data connections become more complex, using a graph to model this information feels more intuitive and efficient.

In a graph, there are two primary elements: vertices (also known as nodes or entities) and edges (also referred to as relationships or arcs).

Graphs can shape various data types, not just those that are similar. They are powerful in combining different kinds of object data within one storage solution.

Take Meta as an example, which uses a unified graph structure filled with diverse vertices and edges. Vertices symbolize individuals, places, occasions, system entries, and comments made by users. Edges reveal connections like friendships, specific login locations, comment authorship, event attendance, and more.

Think of a graph database as composed of two relational tables—one for vertices and another for edges. Properties of vertices and edges might be stored using data types like json. Information on the starting (head_vertex) and ending (tail_vertex) points of edges is kept, making it possible to swiftly access all incoming or outgoing edges related to a particular vertex by querying the respective fields in the edges table.

Key features of this model include:

Any vertex can link to another through an edge without restrictions on the types of connections.
It's possible to identify both incoming and outgoing edges for any vertex, allowing for navigation through the graph in both directions. This is why indexes are placed on both the tail_vertex and head_vertex columns in the edges table.
Utilizing varied labels for different relationship types enables the storage of diverse information in a single graph, keeping the model's structure clean and organized.

Wrapping up

We've taken a good look at how we keep and find our digital information. We've seen that old ways like SQL databases aren't always enough anymore, and that's why we're seeing more use of graph databases. These new methods are like a fresh coat of paint in a world that's always changing and growing, just like how Meta is doing things now.

And we're not done yet! Next time, we'll dig into the nitty-gritty of how data is stored, talk about what "key-value" means, and explore all sorts of indexes. We'll also tackle big concepts like SSTables, LSM-trees, and B-trees, and we’ll try to make sense of how to make everything run faster and smoother. So, stay with us as we keep on exploring this HighLoad adventure. There's a lot more to come!

Cover photo by Ag Ku