Forem: di(nara) critskaya

When Is Monitoring Enough? A Practical Guide to Database Observability

di(nara) critskaya — Thu, 21 Aug 2025 11:51:34 +0000

The Question That Started It All

It all began with a simple LinkedIn post titled "When is Monitoring Enough?" while I was building a Grafana dashboard for Apache Ignite. The question hit me like a lightning bolt: When is enough, enough?

We live in a world where databases demand robust monitoring, and countless tools promise to make our lives easier. Yet in practice, these tools often become part of the problem. They're either excessive for our needs, missing critical features, or simply add to the noise rather than providing clarity.

Meanwhile, business requirements push us toward ever-increasing system complexity. We end up in one of two extremes: drowning in monitoring data we can't use, or flying blind without the observability we desperately need.
So the riddle remains: How much monitoring is truly enough?

Monitoring isn't as simple as it looks

You might handle it over to your infrastructure engineers or DBRE with a thought that they will sort it all out, and the case is solved. But here's the problem: If you don't understand what you're looking at, how can you find the root cause of performance issues?

Installing hundreds of dashboards doesn't magically solve your problems. Without understanding, monitoring becomes a maze rather than a map.

We've encountered a dozen scenarios like this one:

Critical issue strikes
Alerts fire everywhere - but which one matters?
Dashboards show data - but what does it mean?
Hours pass while you hunt for the real problem
Knowledge gaps turn a 5-minute fix into a nightmare

The harsh truth: An inadequate monitoring system can be worse than no monitoring at all. It gives false confidence while hiding real problems and thus takes us away.

When it comes to databases, monitoring isn't just about collecting metrics - it's about understanding what to monitor. Your ops engineers might integrate a thousands of dashboards, hundred alerts, though these don't match your needs, it just turns a system into burden.

The real question becomes: "How do we build a monitoring system that actually helps?"

The Abundance Trap

Having too many options can paralyze decision-making. When everything seems important, nothing is. You face critical questions:

Which metrics actually matter for YOUR use case?
How do you interpret these metrics correctly?
What thresholds indicate real problems vs. normal fluctuations?

The abundance that should empower you instead creates confusion during critical moments - exactly when you need clarity most.

Separating Business Needs from Technical Noise

I once worked with a team of developers who had imported pre-built Grafana dashboards. They didn't understand half of what they were looking at and where to look at. I must say the dashboards were impressive, though at the second glance I realized these dashboards were built not for them, and didn't help developers team to solve their problems. When problems hit, the team spent more time trying to decode the dashboards than fixing the issues.

Through painful experience I learned that most monitoring fails not because it shows too little, but because it shows too much. Every additional metric, graph, or alert adds cognitive weight. A radical idea appeared in my head: Start by removing, not adding.

The dashboard (and metrics) must achieve three simple things:

Simplicity - can we understand what we are looking at;
Clarity - the unmistakable message, yet plain, like "Low memory", not "OOM: Out of memory exception";
Actionability - tells you what to do next, bringing clear response.

I made a conclusion in the end of refactoring: Simplicity isn't dumbing down - it's smartening up.

Manage Monitoring Complexity And Build Your Own Philosophy

Research in software engineering shows that our brains can only handle limited complexity. Cognitive load has been widely studied to help understand human performance. When monitoring becomes too complex, it stops helping and starts hurting.

Consider these cognitive limits:

Working memory: Can hold 5-9 items at once
Attention span: Drops significantly after 20 minutes
Context switching: Each switch costs 15-25 minutes of productivity

You shouldn't also ignore the impact of "why" questions for business. By asking such questions as "why do we need it", "why monitor performance" and etc., you can achieve and work out flexible principles which can be rather applied as your own philosophy:

Monitor for decisions, not data - a potiential action is prefered;
Respect human limitations - keep the obvious in plain sight;
Embrace imperfection - enough monitoring that people use beats perfect monitoring they don't;
Iterate relentlessly - start simple and based on actual incidents.

Conclusion

After building monitoring systems for various databases and learning from actual incidents, I've learned that the question isn't "How much can we monitor?" but rather "What monitoring serves our actual needs?"

Like a car dashboard that shows speed, fuel, and warnings - not every detail of engine operation - your database monitoring should surface what matters for safe operation, not everything that's technically possible.

Remember: The goal isn't to monitor everything. It's to understand your system well enough to monitor the right things.

Start small. Stay focused. Iterate based on reality, not anxiety.

And most importantly, always ask yourself: "If this metric changes, what will I do differently?"

If the answer is "nothing," you've found a metric you don't need.

Databases: Why Does It Matter What We Choose?

di(nara) critskaya — Tue, 14 Mar 2023 16:46:25 +0000

As we all know, database in terms of a project is its vital part, but its vitality depends on decision we make throughout whole development cycle. Databases integrated our life in with such meanings as 'relational' and 'hierarchical' models, but, since those times we have had a dozen databases, and each suggests the solution we need, but in the end our project starts 'suffering' from our decisions related to database we choose and approaches we make.

Traditional DBMS, NoSQL, NewSQL, - we do not lack them, but we do lack the solution and make even wrong one, afterwards having issues on networks or query performance, without saying about obvious business needs. You could solve all these problems by having DBA, but they're not a silver bullet; they can solve your project issues on db, but as far as it's possible.

So, the question "How to choose and design database for project needs and not to stuck with that?" should concern us at first place. Constantly, the path to it shouldn't be as hard as it seems be first time. This is why this article will try to explain (not to give your a proper path) how you could build a pattern due to your databases problems.

What problems can we have?

In the means of 'problems', there can be identified three main large factors: technology, human, economy.

Technologies

Here, your project can be troubled with its current stage of development: rather it's MVP or PoC, or even production-ready. While PoC suggests maneuvers and lacks data certainty, MVP already has its real data for database which is chosen during first iteration, but on the other hand, MVP can have economiс uncertainty that leads up to tech debt in possible future.

Meanwhile, production-ready project has all we need: database, data, code, integrated -ops cultures to be better, nevertheless it can't avoid possibility to have serious issues with databases or its replica sets across multiple servers.

Humans

With technologies, one of project critical part is human factor. Data misconception due to final product subtlety or either misunderstanding between client and team can lead to tech debt or performance issue in production environment and cause mistrust in a team towards each other as a consequence.

In addition to what was said above, there also should be mentioned that chaotic management process and current project bad engineering culture can accelerate databases issues.

Economy

Once in a while, mostly economy isn't the strong side of the project. Occasionally, customer doesn't understand how much it will cost to make desirable data available; team doesn't know how much they're going to spend on database cluster. In a few cases, team or their management can't decide whether it's better and cheaper to support database cluster with their own DBA or buy support on other side.

OK, but what about examples?

Yet, there haven’t been shown any examples so far. Let's answer the questions from our point:

When was the last time you issued problems with database concerning technological side of the project?
When did you last argue on issues involving databases?
Were your calculation made right about database maintenance?

If you answered all these questions above, - you can skip this chapter. If not, - let's see the examples, though you should understand that all following situations may never occur.

Example 1

A team is not satisfied with current database performance and want as well to increase it by 30-40%, but they find out that not only speed was one of their concerns, however it was also about stability. The question was: "How can we improve performance to be sure that 99.99% of our queries will reach our SLA goal?". As the result of their desperate search next conclusions are made: sharding, upgrade or either it's the table.

Even so, all conclusions are worrisome. If they pick sharding, there can be network or read/write problems; in case of upgrades team can encounter financial risks or downgrade. The final conclusion - the table itself or database - waste of time that they lack.

Example 2
Meet Marlene and Yevhen. Marlene is a database expert whereas Yevhen is a developer. Although they’re different, they work in a team so far; their teammates know Marlene and Yevhen can dispute occasionally when it comes up to a database. This time wasn't an exclusion.

Marlene was concerned that Yevhen's task solution might bring troubles in the future and could unexpectedly lead database to such mistake as OOM and database might end up dead. Yevhen kept on saying quite the opposite and tried to reassure team that his solution could solve client's trouble.

How R&D could solve it?

As much as we'd like to satisfy our needs and customers as well, we should understand the importance of engineering culture, or precisely the innovation part, which is likely to be considered one of the factors of company survival. Indeed, it can also become a burden nonetheless. But that's not the point.

Right now let's see how r&d could help us bring more transparency due to database. We could divide our research and development cycle into four steps: theorize, explore, design & develop with further tests implementation, implement our solution itself and improve.

Let us seek how it can be scaled up to databases' problems below.

Theorize (and synthesize)

In means of theorization, we should provide information on research object and exploration's direction. Clearly, we should form here the research object along with known SRE terms: SLA, SLO, SLI. Although, we should remember about trade-offs by two points of view - customer's and team's. While customer defines kind of information he needs to see on final stage, engineering team should present liability to support client's demands regarding needs (from technological aspect up to economy).

Looking back to our examples, we can see that team in Example 1 mostly made a few assumptions (sharding, hardware upgrade and database schema refactoring), whilst Marlene and Yevhen have disagreements due to indicators and agreement with client (Marlene is sure that Yevhen's decision will affect SLA; Yevhen's opinion is all about SLI).

Explore (hypothesize and clarify)

We're trying to make all our theoretical beliefs real here, approving or rejecting client's needs. The desired result is feasibility study along with pilot study; these papers will answer such questions as "Did we get what we wanted? Are there any risks during development or after implementing client needs?" Mostly, it's our first attempt to implement and develop features for database and client's data, - it can be either database schema integration or adding more capabilities to almost existing cluster.

It's okay to identify non-standard situations and experiment with them. As far as we go deeper into details, we’ll make sure that further implementation and design process will be definitely worth the low cost.

In examples above, the team from Example 1 will make experiments to solve all the puzzles in test environment, from tracing query and explain plans till database schema refactoring; for Example 2 Marlene and Yevhen, who are suffering from lack of communication, will sit and absorb their knowledge using 'pair programming'.

Design, develop and test

The necessity of domain knowledge is important nowadays besides tech skills. So to say, each team member should also understand that willy-nilly he can be replaced at any time. Client won't catch the thing you're saying if you're gonna explain to him technical aspects of feature implementation; he is concerned about his business demands and expects from you the same understanding of what he sees by output based on data he provided you with.

The most part of r&d staff is here, in this step. We develop feature and make sure that it will meet SLA we made while building theories. Hereby, we should contribute to documentation and project knowledge base for future team members to work along with us or instead of us; quality assurance should mean the same as development, besides, QA will make sure that our database enhancement or data meets business requirements.

The team didn’t expect that the problem was so simple: the index that caused so much trouble and some optimizations the made a while ago, lead to their current problem. They decided to rollback some of their doings and made a post mortem study, in which they told that it was not only the database's problem but also their misknowledge of its internals. Marlene and Yevhen made out a solution that met their both expectations, in addition to team's client was excited about modifications included to output and approved it.

Implement and improve

The conclusion is successful implementation and data output to client. Nonetheless, the improvements will still be made, but not as fast as it possible, but as fat as it suits us and our client. Of course, it's also about minor bug and fixes to be released after our database changes: from migration and up to altering value type (if it's possible of course).

Conclusion

If you build the system, you should remember that it's not only the code and performance matters. By making approach to r&d, you can improve current team's engineering culture and build up a system where database won't be turn into a catastrophe, but will be controlled and data will meet client's needs.