Forem: Juliano Moreno

Are we using AI to do things faster — or to rethink how we make decisions in software development?

Juliano Moreno — Thu, 23 Apr 2026 13:28:56 +0000

Are we using AI to do things faster — or to rethink how we make decisions in software development?

Something has been on my mind for some time: how we are using AI in software development. Most of the initiatives I have seen focus on using AI inside the standard development process. Because of that, I keep asking myself: did AI come to speed up the processes we already know, or to break our paradigms about software development?

AI started being used to speed up existing processes, but its natural effect is to break these processes.

It is not easy to answer this question with full certainty, because when we talk about breaking patterns, we deal with predictions and uncertainty. Even so, I believe it is very unlikely that the development flow will remain the same as we know today.

There are many initiatives using AI for product discovery and design, as well as for refinement, architecture, development, code review, and finally testing and delivery. AI has been placed into the existing pipeline. The risk is that we are speeding up bad decisions instead of improving decisions. What is changing — and we still do not fully see it — is that before we followed a traditional model: discover requirements, refine, develop, test, and fix. Now, with AI, the trend is different: simulate scenarios and risks before building, create multiple solution options, and evaluate the impact before implementation, only then turning it into code. Building is no longer the center; deciding is becoming the center.

We are still very attached to the process culture we know — based on sprints, tasks, and delivery — and not on risk, impact, and system behavior. In my opinion, the direction is moving us toward using AI before coding, treating quality as an input instead of an output, and adopting executable models, where contracts, scenarios, and flows are treated as the main source, not as derived documentation.

I believe the biggest challenge we will face is no longer how to use AI, but how to prepare for these structural changes. They are happening quietly, while most people are still focused only on speeding up the SDLC as we know it.

A IA veio para acelerar o desenvolvimento… ou para mudar como decidimos?

Juliano Moreno — Thu, 23 Apr 2026 13:22:53 +0000

Estamos usando IA para fazer mais rápido o que já fazíamos — ou para repensar como tomamos decisões no desenvolvimento de software?

Algo que vem me intrigando há algum tempo é como estamos aplicando a IA no desenvolvimento de software. A maioria das iniciativas que tenho acompanhado consiste em aplicar a IA durante o processo padrão de desenvolvimento e, diante disso, eu me questiono: será que a IA veio para acelerar os processos que já conhecemos ou para quebrar os nossos paradigmas sobre o desenvolvimento de software?

A IA começou sendo usada para acelerar processos existentes, mas o efeito inevitável dela é quebrar esses processos.

É complicado responder a essa pergunta com exatidão, pois, sempre que falamos em quebra de padrões, trabalhamos com previsões e incertezas. No entanto, acho muito pouco provável que o fluxo de desenvolvimento continue sendo o mesmo que conhecemos hoje.

São muitas as iniciativas com IA para prospecção e concepção de produtos, assim como para refinamento e arquitetura, desenvolvimento, code review e, por fim, testes e delivery. A IA foi encaixada no pipeline existente. O risco que corremos é estarmos acelerando decisões ruins, e não melhorando decisões. O que está mudando — e ainda não estamos percebendo — é que antes seguíamos um modelo tradicional: descobrir requisitos, refinar, desenvolver, testar e corrigir. No entanto, a tendência que estamos acompanhando, com o uso da IA, é outra: simular cenários e riscos antes de construir, gerar múltiplas alternativas de solução, avaliar o impacto antes da implementação para só então materializar em código. Construir deixa de ser o centro; decidir passa a ser o centro.

Ainda estamos muito presos à cultura do processo que conhecemos — baseada em sprint, tarefa e entrega — e não em risco, impacto e comportamento do sistema. Em minha opinião, o caminho começa a nos direcionar para implantar a IA antes do código, tratar a qualidade como entrada e não como saída e adotar modelos executáveis, nos quais contratos, cenários e fluxos são tratados como fonte primária, e não como documentação derivada.

Entendo que o grande desafio que enfrentaremos não é mais saber como utilizar a IA, mas sim como nos prepararmos para essas mudanças estruturais, que estão acontecendo em silêncio, enquanto a maioria ainda está muito focada apenas em acelerar o SDLC como o conhecemos.

Is Regression Testing really necessary?

Juliano Moreno — Mon, 13 Apr 2026 02:13:37 +0000

Are we testing wrong? Or coding wrong? Or planning wrong? Worse: are we getting everything wrong?

What's the real advantage of adopting a microservices architecture or a segmented system if every software change automatically triggers a regression test request? Is this due to a legitimate fear of system impact? If so, are we truly refining and analyzing these impacts correctly before implementing changes? Or has this simply become a convention we follow without questioning?

Regardless of the reason, it's the QA's role to question it. We must evaluate whether regression testing is genuinely necessary in all situations and propose more precise alternatives with a higher probability of ensuring software quality. Regression testing should be an exception, not the rule.

The Cost (and Risk) of Regression Testing

Regression testing is costly - whether done manually or automated.

For manual tests, depending on the size and complexity of the software, we're talking about hours or even days of work. Yet, even then, there's no guarantee of adequate test coverage. Some parts of the system may remain untested, allowing defects to go unnoticed.

In automated tests, the scenario can be similar. It's common practice to focus on main flows and critical functionalities, prioritizing what seems most relevant. However, this can leave significant gaps, compromising the overall reliability of the tests.

Furthermore, pursuing the idea of automating every flow, interaction, and exception in the software can lead to chaos.
Imagine a scenario where the test codebase surpasses the application's codebase in size. Maintaining this level of automation could become more expensive than evolving the software itself. Testing should add value, not become a bottleneck.

Requesting regression tests for every change is like using a hammer to fix every problem. It's not always the right tool for the job.

How to Reduce Dependency on Regression Testing

Honestly, there's no single or definitive answer to this question. However, some practices, when applied correctly, can help reduce the excessive reliance on regression testing while increasing confidence in the software.

1. Sound Architectural Practices
The system's architecture is the foundation. In modern architectures like microservices, it's crucial to clearly define the boundaries of each module and ensure their responsibilities are well-delimited.

2. Effective Unit and Integration Tests
Unit tests are the first line of defense. They ensure that each code unit functions as expected.
Integration tests, in turn, validate whether the microservice interacts correctly with real dependencies (RabbitMQ, databases, external APIs). This reduces the need for broad regression tests by confirming the service's behavior in scenarios close to production.

3. Strategic Test Planning
QA needs to invest more time in impact analysis. Software changes rarely affect the entire system. Understanding where the real impacts are is essential for planning more focused and relevant tests.

Two approaches can help structure this planning more strategically:

- Shift-Left Testing
This practice emphasizes prioritizing quality from the earliest stages of the development cycle. Unit, integration, and contract tests are executed as early as possible, in parallel with development, reducing the risk of accumulated issues that require extensive regression tests later. This also fosters a culture of collaboration between development and QA, enabling early identification of risks.

- Risk-Based Testing
This strategy focuses on prioritizing areas of the system most likely to fail or with the greatest potential impact if something goes wrong. QA identifies critical functionalities, analyzes the risks associated with each change, and directs testing efforts to the areas that matter most. This allows for smarter coverage, minimizing the need to validate the entire system with regression tests.

How These Practices Help:

With shift-left testing, many problems are identified and fixed before reaching the integration stage, reducing the need for late-stage validations.
Risk-based testing ensures time and resources are spent where they truly make a difference, avoiding unnecessary effort in low-impact areas.

4. Automation Aligned with Business Goals
Automation should be strategic. Not everything needs to be automated, but what is automated should be carefully chosen. Prioritize critical scenarios and those that add real value to the business. Automating indiscriminately only increases costs and complexity.

5. Monitoring in Production
Modern tools allow for monitoring software in production, identifying real usage problems that tests might not foresee. This continuous feedback complements testing and reduces the need for extensive regression suites.

Shifting Focus: Intelligent Quality

Regression tests are often used as a safety net to compensate for deficiencies in other areas, such as planning, architecture, or development. They still have their place. In large-scale changes, such as extensive refactoring or critical updates, regression tests may be necessary to validate overall stability. However, they should be the exception, not the rule.

Testing what truly matters is more important than attempting to test everything. This requires a joint effort between development, QA, and architecture to create more resilient systems and eliminate blind spots in the development cycle.

In the end, quality is not just about testing. It's about strategically thinking ahead to prevent failures before they even exist.

The system failed. Your log should explain why

Juliano Moreno — Wed, 08 Apr 2026 14:58:00 +0000

In the previous article, I brought up a point that is rarely discussed: a bad log can be as dangerous as a bug in production.

That leads to a natural question:

If logs are so critical, what should we actually analyze?

In practice, when something fails, the first reaction of the team is to check the logs. That is where everyone expects to find answers.

The problem is that, most of the time, logs don’t deliver.

They are verbose, full of long stack traces generated by frameworks, generic messages, and very little useful context. The information is there, but it is not actionable.

The result is predictable: time wasted filtering noise, difficulty finding the real point of failure, and often the need to reproduce the issue just to understand what happened.

This is not a tooling problem. It is a quality problem.

From a QA perspective, logs are not technical output. Logs are operational evidence, and evidence must be reliable.

Logs are not for developers. They are for the system to explain itself

When an incident happens, no one cares how the code was written.

The question is simple:

What happened?

If the log doesn’t answer that, someone will have to investigate manually. And that costs time, trust, and money.

Standards already exist. The problem is not using them

There is no lack of reference.

The most common and widely adopted approaches are 5W + H (what, where, when, who, why, how) and Event + Context + Outcome. Standards like OpenTelemetry and Elastic Common Schema reinforce the same idea: logs must be structured, contextualized, and traceable.

There is no complexity here. A good log describes an event with enough context, a clear outcome, and the ability to trace it.

What should be analyzed in logs

Flow reconstruction

A good log should allow you to understand the beginning, middle, and end of a flow. If that is not possible, there is an observability problem. Missing logs are blind spots.

Context

Logs must clearly show which entity was affected. Without identifiers like orderId, paymentId, or userId, there is no investigation.

Clarity

Messages must be direct and unambiguous. “Error processing” explains nothing. If you need to read the code to understand the log, it failed.

Severity

Severity must reflect impact. When everything is INFO or everything is ERROR, the signal is lost. Logs should distinguish normal behavior, controlled issues, and real failures.

Traceability

In distributed systems, logs must be connected. Without traceId or correlationId, each log becomes an isolated piece, and isolated pieces don’t explain complex flows.

Critical points

Logs must exist where risk exists: external integrations, state changes, key decisions, retries, and fallbacks. If logs appear only at the final error, it is already too late.

System behavior

Logs should explain what the system did after an event. Did it retry? Fallback? Abort? Without this, the diagnosis is incomplete.

Impact

Knowing that something failed is not enough. Logs should show the impact: was the operation interrupted, was data affected, was the user impacted?

Noise

More logs do not mean better logs. Too much information can be as harmful as too little.

Sensitive data

Logs must not expose sensitive information such as passwords, tokens, or personal data. This is also a quality concern.

Where QA should evaluate this

Logs should not be evaluated only in production.

Code review

Code reviews are usually done by developers, but log quality criteria must be present. Critical points should be logged, context must be sufficient, and messages must be clear. The role of QA is not to perform the review, but to ensure that these criteria exist and are applied.

Tests

Logs should be validated mainly in development-level tests, such as integration tests and, when necessary, unit tests. It is important to verify if logs are generated in relevant scenarios, if the content is correct, and if there are unnecessary logs.

At higher levels, such as E2E and API tests, logs act as support for diagnosis. They should help explain system behavior, allow flow correlation, and reduce the need to reproduce issues.

Incidents

Logs must also be evaluated during incidents. Did they help or slow things down? Were there blind spots?

The real problem

The problem is not the absence of logs. The problem is the absence of criteria.

Without criteria, each developer logs in a different way, each service tells a different story, and each incident becomes a manual investigation.

A simple question

Can you understand what happened without running the system again?

If the answer is no, there is a quality problem.

Final thoughts

Logs are not a technical detail. They are not debug. They are not optional.

Logs are part of the system, and they must be treated that way.

Otherwise, when they are most needed, they will fail.

O sistema caiu. Seu log deveria explicar por quê.

Juliano Moreno — Tue, 07 Apr 2026 14:01:27 +0000

No artigo anterior, trouxe um ponto que pouca gente discute: um log ruim pode ser tão perigoso quanto um bug em produção.

Agora vem a pergunta natural:

Se log é tão crítico assim, o que exatamente deveria ser analisado?

Na prática, quando algo falha, o primeiro movimento do time é olhar para o log. É ali que todos esperam encontrar respostas.

O problema é que, na maioria das vezes, o log não cumpre esse papel.

Logs verbosos, com stacktraces extensos gerados pelo framework, mensagens genéricas e pouco contexto útil. Informação existe, mas não é acionável.

O resultado é previsível: tempo perdido filtrando ruído, dificuldade para chegar no ponto exato da falha e, muitas vezes, necessidade de reproduzir o problema para entender o que aconteceu.

Isso não é problema de ferramenta. É problema de qualidade.

Na visão de QA, log não é saída técnica. Log é evidência operacional.

E evidência precisa ser confiável.

Log não é para o Dev. É para o sistema se explicar

Quando um incidente acontece, ninguém quer saber como o código foi escrito.

A pergunta é simples:

O que aconteceu?

Se o log não responde isso, alguém vai ter que investigar manualmente.

E isso custa tempo, confiança e dinheiro.

Padrões já existem. O problema é não usar

Não falta referência.

Os modelos mais simples e adotados são:

5W + H (o que, onde, quando, quem, por que e como)
Event + Context + Outcome (evento, contexto e resultado)

Padrões como OpenTelemetry e Elastic Common Schema reforçam a mesma ideia:

log precisa ser estruturado, contextualizado e rastreável.

Não tem complexidade aqui.

Um bom log descreve um evento, com contexto suficiente, resultado claro e possibilidade de correlação.

O que deve ser analisado em logs

1. O fluxo pode ser reconstruído?

Dá para entender início, meio e fim? Se não dá, existe um problema de observabilidade. Buraco em log é ponto cego.

2. Existe contexto suficiente?

Dá para saber qual entidade foi afetada? Sem orderId, paymentId, userId, não existe investigação.

3. A mensagem é clara?

“Erro ao processar” não explica nada. Se precisa abrir o código para entender o log, ele falhou.

4. O nível de severidade faz sentido?

Tudo como INFO ou tudo como ERROR é erro.

INFO → fluxo normal relevante
WARN → comportamento inesperado controlado
ERROR → falha com impacto

Severidade errada distorce leitura do sistema.

5. Existe rastreabilidade?

Dá para seguir o fluxo entre serviços? Sem traceId ou correlationId, cada log vira uma peça isolada. Peça isolada não explica sistema distribuído.

6. Os pontos críticos estão logados?

integrações externas
mudanças de estado
decisões relevantes
retries e fallbacks

Log só no erro final não resolve.

7. O sistema se explica?

O log mostra o que o sistema fez?

tentou novamente?
fez fallback?
abortou?

Sem isso, o diagnóstico fica incompleto.

8. Existe visão de impacto?

Falhou. E daí?

operação interrompida?
dado inconsistente?
impacto no usuário?

Log técnico sem impacto não ajuda decisão.

9. Existe ruído?

Mais log não significa melhor log. Excesso atrapalha tanto quanto falta.

10. Existe exposição de dados sensíveis?

Senha, token, CPF, cartão. Se aparece em log, existe problema.

Onde o QA deveria avaliar isso

Log não deve ser avaliado só em produção.

Code Review (código do sistema)

O review é feito pelos Devs. Ainda assim, critérios de log precisam estar presentes:

pontos críticos estão logados?
existe contexto suficiente?
a mensagem é clara?

O papel do QA não é executar o review. É garantir que o critério exista.

Testes

Principalmente em testes de desenvolvimento (integração e, quando necessário, unitários):

logs são gerados nos cenários relevantes?
o conteúdo está correto?
existem logs indevidos?

Aqui o log faz parte da validação.

Em níveis mais integrados (E2E e APIs), o log vira suporte:

ajuda a explicar o comportamento?
permite correlacionar fluxo?
evita precisar reproduzir problema?

Incidentes

o log ajudou ou atrapalhou?
houve ponto cego?

O problema real

Não é falta de log. É falta de critério.

Sem critério:

cada dev loga de um jeito
cada serviço conta uma história diferente
cada incidente vira investigação manual

Uma pergunta simples

Dá para entender o que aconteceu sem rodar o sistema novamente?

Se não dá, existe problema de qualidade.

Fechamento

Log não é detalhe técnico. Não é debug. Não é opcional.

Log é parte do sistema. E precisa ser tratado como tal.

Caso contrário, no momento em que ele for necessário, ele não vai cumprir o seu papel.

Bad logs are as dangerous as bugs in production

Juliano Moreno — Mon, 06 Apr 2026 15:20:48 +0000

A normal workday. The system has a critical failure. The whole team runs to check the logs, and at that moment, they see that the logs have too much unnecessary information. This makes it hard to find the real point of failure. In some cases, logs have too little information. This forces the Dev and/or QA to debug the system, recreating the same conditions where the error happened to try to find the cause. In worse situations, there is no log at that point in the system. Who has never experienced something like this?

Logs are rarely treated as critical points during architecture, code review, and validation phases. They should be. Logs are like home insurance: you only care when something bad happens. In my more than 20 years of experience, I have rarely seen teams treat logs the way they should.

From a QA perspective, analyzing logs often feels like being Indiana Jones — doing archaeology work. Most of the time, logs are not structured and not objective. It becomes a tiring task: unclear messages, confusing stack traces, unnecessary or missing information. A lot of time is lost on something that should be fast. This directly impacts response time to customers, planning, and delivery of the fix. That is why QA often feels discouraged to do this work.

There is usually no standard for logs in companies. Each product uses its own structure, naming, content, and data masking. On top of that, many Devs lack experience — or even technical knowledge — to use these patterns correctly, and mainly to know where in the code logs should be added.

Now imagine this scenario in a critical system for a company. For example, a marketplace microservice responsible for charging credit cards goes down during Black Friday. What would be the team’s response time? How confident would they be that they are looking at the correct point that caused the failure?

Logs should be planned during architecture, refined during development, and also used as a quality criterion in code review and tests.

A good log should answer, at least, these questions:

1- What happened?
2- Where did it happen?
3- In which context (which entity was affected)?
4- What was the result (success, failure, fallback)?
5- What was the impact on the system or the business?
6- Why did it happen (when possible)?
7- What did the system do after that (retry, fallback, abort)?
8- How can I trace this event (traceId, correlationId)?
9- Is this expected or abnormal behavior?

Log is not debug.
Log is operational evidence.

And bad evidence leads to wrong decisions.

Log ruim é tão perigoso quanto bug em produção

Juliano Moreno — Mon, 06 Apr 2026 14:52:09 +0000

Dia comum de trabalho. O sistema apresenta uma falha crítica. O time todo corre para ver o log e, nesse momento, descobre que ele possui muitas informações desnecessárias, o que dificulta chegar ao ponto que realmente falhou. Em alguns casos, o log tem poucas informações, o que força o Dev e/ou QA a debugarem o sistema, recriando as mesmas condições em que o erro ocorreu para tentar identificar a causa da falha. Em situações ainda piores, não há disparo de log nesse ponto do sistema. Quem nunca passou por algo semelhante?

Poucas vezes, nas fases de arquitetura, code review e validação, logs são tratados como pontos críticos. Deveriam ser. Log é como seguro de casa: você só dá importância no momento da catástrofe. Na minha experiência de mais de 20 anos de estrada, raramente vi times tratando esse tema da forma como deveria ser tratado.

Na minha visão de QA, analisar logs muitas vezes é se sentir o Indiana Jones — fazer trabalho de arqueologia. Na maioria das vezes, eles não são estruturados e, muito menos, objetivos. É uma tarefa desgastante: mensagens pouco claras, stacktraces confusos, informações desnecessárias ou ausentes. Perde-se muito tempo em algo que deveria ser rápido. Isso impacta diretamente no tempo de resposta ao cliente, no planejamento e na entrega da correção. Por isso, muitas vezes o QA se sente desencorajado a executar essa tarefa.

Normalmente há falta de padronização de logs nas empresas. Cada produto adota seu próprio padrão de estrutura, nomenclatura, conteúdo e ofuscamento de informações. Soma-se a isso a falta de experiência — ou até mesmo de conhecimento técnico — de muitos Devs para utilizar corretamente esses padrões e, principalmente, para saber onde, no código, a captura de logs deve ser feita.

Agora imagine todo esse cenário em um software crítico para uma empresa. Por exemplo, um microsserviço de um marketplace responsável por débito em cartões de crédito fora do ar em plena Black Friday. Qual seria o tempo de resposta do time? Qual seria a confiança de que estão analisando o ponto correto que causou a falha?

Log deveria começar a ser pensado na arquitetura, ser lapidado no desenvolvimento e, além disso, deveria ser critério de qualidade para barrar no Code Review e também nos testes.

Um bom log deve responder, no mínimo, a estas perguntas:

1- O que aconteceu?
2- Onde aconteceu?
3- Com qual contexto (qual entidade foi afetada)?
4- Qual foi o resultado (sucesso, falha, fallback)?
5- Qual foi o impacto no sistema ou no negócio?
6- Por que aconteceu (quando possível identificar)?
7- O que o sistema fez após isso (retry, fallback, abortou)?
8- Como eu rastreio esse evento (traceId, correlationId)?
9- Isso é um comportamento esperado ou anormal?

Log não é debug.
Log é evidência operacional.

E evidência ruim leva a decisões erradas.

Why testing became a commodity?

Juliano Moreno — Mon, 06 Apr 2026 14:48:06 +0000

We are entering a new phase in software engineering. AI can generate code, create tests, execute scenarios, and validate behavior at scale.

Execution is no longer the hard part.

This creates an uncomfortable reality: testing, as an activity, is losing value.

QA will not stop testing. Testing will still happen, but it will focus on critical parts of the system — where failures really matter. Most of the time, QA will act at a system level and at a strategic level, understanding risk and deciding what to do about it.

The problem is not speed. The problem is decision. More tests, executed faster, do not mean better quality. They often create a false sense of control.

This is the point that few people are discussing.

The structural problem in QA today

Inside software engineering, in my opinion, there is a structural problem happening. QA professionals are still treated as executors of an activity that can be replaced by AI. In practice, what is being replaced is not the QA professional, but the superficial model that the market accepted for a long time.

AI can generate code, create tests, execute scenarios, and validate behavior. This leads to a direct conclusion: testing has become a commodity. Everything that becomes a commodity loses value over time.

The problem was never testing. The problem has always been understanding the system well enough to know where it can fail and what to do about it.

System-level QA: understanding to see risk

Important failures do not appear only in the interface. They do not appear when testing API endpoints or changing payloads in isolation. The failures that really matter usually come from architecture, data flow, consistency, messaging, integrations, and side effects.

This type of problem is not visible by looking at the screen or only checking requests and responses. Understanding the system is essential. Without a system view, there is no real risk analysis — only superficial validation, which is exactly what AI can do very well.

AI is not eliminating QA professionals. It is eliminating QA that never went beyond the surface. A new level of understanding is required: a QA professional who understands architecture, understands data flow, sees distributed behavior, anticipates side effects, and identifies real failure points.

This is the system-level QA profile. It does not focus only on validating behavior. It understands how the system really works and anticipates where it can fail.

Strategic QA: deciding about risk

Understanding the system is not enough. A professional can deeply understand the architecture, identify complex risks, and still not generate real impact. Value is not only in seeing the problem, but in deciding what to do about it.

QA at this level involves prioritization, context, understanding business impact, defining the right level of evidence, and making conscious decisions about risk. Without these elements, technical knowledge becomes just opinion — and opinion does not scale.

Real value is in the combination. System-level QA sees the risk, and strategic QA decides what to do about it. Without system understanding, the problem is not seen. Without decision, the problem remains.

Conclusion

In the current AI scenario, execution is no longer a differentiator. Any team using AI can produce fast. The difference is in who can make decisions without breaking what supports the business.

This is not a testing problem. This is a decision problem.

AI is not replacing QA professionals. It is separating two profiles: those who validate behavior and those who understand the system and make decisions based on risk. One will disappear, and the other will become essential.

The question is no longer “how can I test better?” It becomes more fundamental: do I understand the system well enough to see risk? Can I turn that into decisions that protect the business?

Without this capability, moving faster only makes failures happen faster.

Por que testar virou commodity?

Juliano Moreno — Mon, 06 Apr 2026 14:43:00 +0000

Estamos entrando em um dos momentos mais contraditórios da engenharia de software. Nunca produzimos tanto, nunca produzimos tão rápido e, ainda assim, nunca estivemos tão próximos de perder o controle sobre o que estamos produzindo. A IA está acelerando tudo — código, testes e decisões operacionais. Esse cenário expõe uma constatação incômoda: testar, como atividade, está perdendo valor.

Produzir mais não significa entregar melhor, assim como produzir mais rápido não significa ter mais controle. Esse é o ponto que pouca gente está discutindo.

O erro estrutural do QA atual

Dentro da engenharia de software, na minha opinião, existe um erro estrutural acontecendo. O profissional de QA ainda é tratado como executor de uma atividade substituível pela IA, quando, na prática, o que está sendo substituído não é quem atua em QA, mas sim o modelo superficial que o mercado se acostumou a aceitar.

A capacidade atual da IA de gerar código, gerar testes, executar cenários e validar comportamento exige uma conclusão direta: testar virou commodity. Tudo que se torna commodity perde valor com o tempo.

O problema nunca foi testar. O problema sempre foi entender o sistema o suficiente para saber onde ele pode falhar.

QA sistêmico: entender para enxergar o risco

Falhas relevantes não aparecem apenas na interface. Elas não surgem ao testar endpoints de API ou ao variar payloads de forma isolada. As falhas que realmente importam costumam emergir na arquitetura, no fluxo de dados, na consistência, na mensageria, nas integrações e nos efeitos colaterais.

Esse tipo de problema não é visível olhando tela, tampouco analisando apenas requisição e resposta. O entendimento do sistema torna-se essencial. Sem visão sistêmica, não existe análise real de risco — existe apenas validação superficial, exatamente o tipo de atividade que a IA executa melhor.

A IA não está eliminando quem atua em QA. Ela está eliminando o QA que nunca saiu da superfície. Surge, então, a necessidade de um novo nível de entendimento: um profissional de QA que compreende arquitetura, entende fluxo de dados, enxerga comportamento distribuído, antecipa efeitos colaterais e identifica pontos reais de falha.

Esse é o perfil de QA sistêmico. Não se limita a validar comportamento, entende como o sistema realmente se comporta e antecipa onde ele pode falhar.

QA estratégico: decidir sobre o risco

Entender o sistema não é suficiente. Um profissional pode conhecer profundamente a arquitetura, identificar riscos complexos e, ainda assim, não gerar impacto real. O valor não está apenas em enxergar o problema, mas em decidir o que fazer com ele.

A atuação de QA nesse nível envolve priorização, contexto, entendimento do impacto de negócio, definição do nível de evidência necessário e, principalmente, uma avaliação consciente sobre risco. Sem esses elementos, conhecimento técnico se reduz a opinião — e opinião não escala.

O valor real está na interseção. O QA sistêmico enxerga o risco, enquanto o QA estratégico decide o que fazer com ele. Sem visão sistêmica, o problema passa despercebido. Sem visão estratégica, o problema permanece.

Conclusão

No cenário atual com a IA, a execução deixa de ser diferencial, assim como velocidade também deixa de ser. Qualquer time com IA consegue produzir rápido. A diferença passa a estar em quem consegue mudar rápido sem quebrar o que sustenta o negócio. Trata-se de um problema de decisão, não de teste.

A IA não está substituindo o profissional de QA. Ela está separando dois perfis: quem valida comportamento e quem entende o sistema e decide com base em risco. Um tende a desaparecer, enquanto o outro tende a se tornar essencial.

A pergunta deixa de ser “como eu testo melhor?” e passa a ser mais fundamental: entender o sistema o suficiente para enxergar risco e transformar isso em decisão que protege o negócio.

Sem essa capacidade, acelerar mais só faz quebrar mais rápido.

More Than a Bug: How Root Cause Analysis Helped Payroll

Juliano Moreno — Thu, 17 Jul 2025 23:54:00 +0000

“The payroll is wrong. Many workers got double night bonus.”

I saw this message in the support channel on Monday morning. I thought it was only one more bug in production. But when we looked more, we saw the problem was not only one if in the code.

It was time to stop fixing only what we see. We needed to find the real reason for the problem. It was time to use Root Cause Analysis (RCA).

Note: Examples in this post use Brazilian labor law for night shift calculations.

The Start: Just Another Bug?

The team who works with the payroll module in our ERP got a message.
Clients said the night bonus was wrong in the last month.

One developer checked and saw that people who clocked in from 10pm to midnight got the bonus twice or on the wrong day.

The code was using the time from the database in UTC:

const nightBonusStart = dayjs(clockIn).hour() >= 22;

Our server uses UTC, but data comes from Brazil (BRT). So “10pm” in Brazil became “1am” next day in database.

“I just need to fix the time, I will change and deploy.”

Done. Payroll fixed. HR happy.
But next month, the bug was back in another part of the code.

Then we understood: we were fixing results, not causes.

Why RCA Is Important

Root Cause Analysis (RCA) helps you find the real reason for a problem. Not only what you see.

People use RCA in software, DevOps, quality, and also in health and industry.

Instead of asking:

“How can I fix this bug now?”

RCA asks:

“Why did this happen, and how can I stop it forever?”

How We Used RCA

1. Method: 5 Whys

We start with a simple way: we ask “why?” many times until we find the root cause.

Example in our case:

Why is night bonus wrong? Because calculation used wrong time.
Why wrong time? Because time was saved in UTC.
Why do we use UTC without change? Because the code used dayjs without timezone.
Why not set timezone? Because new library (day.js) does not change timezone alone.
Why we did not find before? Because tests do not check timezone or 10pm and midnight.

Conclusion:
The problem was not only code. It was also missing timezone and no test for this case.

2. Method: Change Analysis

We check what changed before the problem.

What we saw:

Two days before, a PR changed moment.js to day.js.
We wanted better performance.
But day.js does not have timezone by default. Need plugins.

A small change in code made a big problem because we did not check the business rules.

3. Method: Barrier Analysis

We check what should protect us but did not work.

Problems:

Automated tests?
They exist, but no test for timezone or night shift. Data was always 8am–5pm. No night. So tests “passed” but real problem was there.
Code review?
It happened, but a junior dev who does not know payroll did the review. No meeting between dev and product teams. Rules (like night bonus, timezones) are not clear for all.
Acceptance test?
Dev team used only simple data. No test for night shift. Product team did not check before production.

Many protections failed. Problem is in process, not only code.

4. Complementary Technique: Ishikawa Diagram (Fishbone Diagram)

This technique helps visualize categories of causes around the problem.

Category	Cause
People	Reviewer with no payroll experience
Tools	CI without timezone-related tests
Process	Functional staging not required
Environment	Server uses UTC, users are in BRT
Data	Seed does not cover night shift
Tests	No coverage for calculations between 10pm and 5am

This reinforces that the problem is multifactorial, not just a simple isolated bug.

See below how we represented this in an Ishikawa diagram:

It’s worth noting that the diagram was also refined over time. Some causes were condensed to avoid redundancy (for example, “weak staging” and “no calculation review”), while others were highlighted in more detail, such as misalignment with the Product team. This exercise helps the team visualize causal relationships and make decisions on where to act first.

How we built the Ishikawa Diagram

The Ishikawa Diagram, also called the Fishbone Diagram, is a visual technique used to organize the causes of a problem into categories.

In this case, the problem analyzed was:

"Night shift allowance calculated incorrectly"

From there, we identified six major categories of causes:

People
Tools
Process
Environment
Data
Tests

For each category, we listed:

Detail: direct causes
Sub-detail: more specific causes

For example, in the People category:

Detail: Junior reviewer
- Sub-detail: No experience in payroll
- Sub-detail: Undocumented rule

This process was conducted collaboratively with the team, combining:

The 5 Whys technique
Change analysis
Code inspection
Identification of failed safeguards

By visualizing the causes this way, it became clear that it wasn’t just a bug in the code, but rather a chain of organizational, technical, and communication failures.

Problems and Solutions

Original Problem	How We Fixed
Product must review PRs	Product does not review code, but must check rules. For sensitive code (payroll, finance), do a meeting and check together before final approve.
“Seed” not explained	Now we explain: seed is fake data for test
“DST ignored” could confuse	Now we say: no test for Daylight Saving Time
No test for 10pm–5am	We keep and explain more about night bonus

Simple Takeaway

The diagram showed that the bug did not come from only one place, but many parts of the system not working together.

What We Changed

Quick fixes:

Fix the code to use timezone:

  dayjs.tz(clockIn, 'America/Sao_Paulo').hour() >= 22

Redo payroll for all affected clients.

Prevention:

Add new tests for:
- 10pm, midnight, 5am
- Different timezones (BRT, UTC)
- Daylight Saving Time (DST)
Make a checklist for all PRs that change payroll, calculation or finance rules.
Always set timezone in API, backend, and tests.
Add product team to check sensitive PRs.

Why Tell This Story?

We did not write a boring postmortem. We told a story.

This helps to:

Make team pay attention
Help non-technical people understand
Make lessons easy to remember
Build a team that wants to improve

Final Lessons: RCA Is Necessary

Fixing bugs without knowing the real reason is like cleaning water without fixing the leak.

RCA helps to:

Stop fixing the same bug again and again
Learn every time
Make stronger systems
Improve your work step by step

RCA Checklist

[ ] Did we stop and think about the error?
[ ] Did we check what changed before?
[ ] Did we use the 5 Whys?
[ ] Did we check which protections failed?
[ ] Did we write a postmortem to share?
[ ] Did we do prevention, not only a fix?

Conclusion

This “timezone bug” showed us problems in code, process, and team work.

RCA helped us work better, not just fix a bug. This is the difference between teams who only fix, and teams who grow.

Are you only fixing bugs, or fixing the real problem?

Muito além do Bug: Como a Análise de Causa Raiz salvou a folha de pagamento

Juliano Moreno — Wed, 16 Jul 2025 17:05:34 +0000

“A folha saiu errada. Vários funcionários receberam o dobro de adicional noturno.”

Essa frase, que chegou via mensagem no canal de suporte numa manhã de segunda-feira, parecia apenas mais um bug em produção. Mas, à medida que o time mergulhava na investigação, percebemos que o problema ia muito além de um if mal posicionado.

Era hora de parar de tapar buracos e começar a entender o porquê real das falhas. Era hora de aplicar a Análise de Causa Raiz.

O começo: um bug como qualquer outro

O time responsável pelo módulo de folha de pagamento do nosso ERP recebeu a notificação:
clientes estavam reportando valores incorretos no adicional noturno do mês anterior.

Após uma análise inicial, o dev de plantão identificou que funcionários que bateram ponto entre 22h e 00h estavam tendo o adicional duplicado ou deslocado para o dia seguinte.

No código, a lógica de cálculo baseava-se no horário UTC da base de dados:

const inicioAdicional = dayjs(entrada).hour() >= 22;

Como o servidor rodava em UTC e os dados vinham em BRT, o horário registrado como "22h" no Brasil aparecia como "01h" do dia seguinte no banco.

— “Ah, é só ajustar a conversão do horário, já corrijo isso e faço o deploy.”

Feito. A folha foi reprocessada. RHs acalmaram-se. Vida que segue.
Mas no mês seguinte… o erro voltou, só que em outro ponto do cálculo.

E aí a ficha caiu: estávamos tratando efeitos, não a causa.

Por que Análise de Causa Raiz (RCA) importa tanto?

A Análise de Causa Raiz (Root Cause Analysis) é uma técnica usada para identificar a verdadeira origem de um problema — não apenas a sua manifestação visível.

Ela é amplamente utilizada em engenharia de software, SRE, DevOps, qualidade, e até em setores como saúde, segurança e indústria.

Ao invés de perguntar:

“Como eu resolvo esse bug agora?”

A RCA propõe:

“Por que isso aconteceu em primeiro lugar e o que posso mudar para que nunca aconteça de novo?”

Como aplicamos RCA na prática

1. Técnica: 5 Porquês (Five Whys)

Começamos com a técnica mais simples: perguntar repetidamente “por quê?” até chegar à causa raiz.

Aplicação no nosso caso:

Por que os adicionais noturnos estavam errados?
Porque estavam sendo calculados para o horário errado.
Por que o cálculo usou o horário errado?
Porque o horário do ponto foi registrado em UTC.
Por que estávamos usando UTC sem converter?
Porque a função de cálculo usava dayjs sem configuração de timezone.
Por que o timezone não foi configurado corretamente?
Porque a nova lib (day.js) não ajusta automaticamente o fuso horário.
Por que isso não foi detectado antes?
Porque os testes automatizados não cobriam fuso horário e horários críticos como 22h ou 00h.

Conclusão: o problema não era apenas técnico — era estrutural: ausência de configuração de timezone + lacunas de teste.

2. Técnica: Análise de Mudança (Change Analysis)

Aqui, buscamos o que mudou recentemente que poderia ter causado o problema.

Aplicação:

Dois dias antes do incidente, foi feito merge de um PR que trocou moment.js por day.js.
O objetivo era otimizar o bundle e melhorar performance.
Porém, day.js não tem suporte nativo para timezone — exige o uso de plugin utc + timezone.

Uma pequena mudança técnica acabou gerando um impacto funcional crítico, porque o contexto de negócio não foi considerado.

3. Técnica: Análise de Barreiras

Esta técnica nos ajuda a identificar quais proteções falharam e permitiram que o bug escapasse.

Aplicação:

Testes automatizados?
Existiam, mas não cobriam cenários com fuso horário nem jornadas noturnas. Os dados simulados usados em desenvolvimento seguiam uma jornada fixa de 08h–17h, que não refletia os casos reais de trabalho noturno ou de virada de dia. Como resultado, os testes automatizados não cobriam cenários com adicional noturno, viradas de datas ou DST (horário de verão). Isso criou uma falsa sensação de segurança, já que os testes “passavam”, mesmo com a regra incorreta no código.
Revisão de código?
Foi feita, mas por um revisor júnior sem conhecimento de folha. Outro ponto importante foi a falta de alinhamento entre os times técnico e de produto. A ausência de um entendimento compartilhado das regras de negócio (como adicional noturno, jornada cruzando datas e impacto de timezone) dificultou a prevenção da falha. Quando o conhecimento está isolado ou implícito em um único time, decisões técnicas acabam desconsiderando aspectos funcionais fundamentais.
Homologação funcional?
Foi realizada pelo time técnico e com dados genéricos, sem simular casos reais como jornada noturna. O deploy foi direto para produção sem validação do time de produto.

Múltiplas barreiras falharam em sequência, revelando fragilidade no processo, não apenas no código.

4. Técnica complementar: Diagrama de Ishikawa (Espinha de Peixe)

Essa técnica ajuda a visualizar categorias de causas em torno do problema.

Categoria	Causa
Pessoas	Revisor sem experiência em folha de pagamento
Ferramentas	CI sem testes com timezone
Processo	Homologação funcional não exigida
Ambiente	Servidor usa UTC, usuários estão em BRT
Dados	Seed não contempla jornada noturna
Testes	Sem cobertura de cálculos entre 22h e 05h

Isso reforça que o problema é multifatorial, não um simples bug isolado.

Veja abaixo como representamos isso em um diagrama de Ishikawa:

Vale observar que o diagrama também passou por refinamentos. Algumas causas foram condensadas para evitar redundância (por exemplo, “homologação fraca” e “sem revisão de cálculo”), enquanto outras foram mais bem destacadas. Esse exercício ajuda o time a visualizar as relações de causa e a tomar decisões sobre onde atuar primeiro.

Como construímos o Diagrama de Ishikawa

O Diagrama de Ishikawa, também chamado de Espinha de Peixe, é uma técnica visual usada para organizar as causas de um problema em categorias.

Nesse caso, o problema analisado foi:

"Adicional noturno calculado errado"

A partir dele, identificamos 6 grandes categorias de causas:

Pessoas
Ferramentas
Processo
Ambiente
Dados
Testes

Para cada categoria, listamos:

Detail: causas diretas
Sub-detail: causas mais específicas

Por exemplo, na categoria Pessoas:

Detail: Revisor júnior
- Sub-detail: Sem experiência em folha
- Sub-detail: Regra não documentada

Esse processo foi feito em conjunto com o time, combinando:

Técnica dos 5 Porquês
Análise de mudança
Inspeção do código
Levantamento de barreiras que falharam

Ao visualizar as causas dessa forma, ficou claro que não foi apenas um erro na linha de código, mas sim uma cadeia de falhas organizacionais, técnicas e de comunicação.

Contradições corrigidas:

Problema Original	Correção Aplicada
Produto deveria revisar PR	Embora o time de Produto não revise código diretamente, é fundamental que ele esteja envolvido na validação das regras de negócio implementadas. Para PRs que afetam processos sensíveis como folha ou financeiro, o time técnico pode incluir uma etapa de validação funcional conjunta com Produto antes da homologação final. Essa colaboração evita que regras críticas passem despercebidas por quem entende mais do código do que da lógica do negócio.
Termo “seed” não explicado	Agora explicado como conjunto de dados simulados para testes
“DST ignorado” poderia confundir	Explicado como ausência de testes em dias de ajuste de horário de verão

Conclusão didática

Ao final da análise, o diagrama serviu como um mapa visual da falha. Ele mostrou que o erro não veio de um só lugar — mas sim de várias partes do sistema que não estavam conversando bem entre si.

O que fizemos diferente desta vez

Ações corretivas imediatas:

Corrigimos o cálculo com suporte explícito a timezone:

dayjs.tz(entrada, 'America/Sao_Paulo').hour() >= 22

Reprocessamos as folhas dos clientes afetados.

Ações preventivas (de verdade):

Adicionamos testes automatizados com:
- Horários limítrofes (22h, 00h, 05h)
- Fusos diferentes (BRT, UTC, etc.)
- Casos de horário de verão (DST)
Criamos um checklist obrigatório para PRs que alteram regras de negócio sensíveis (folha, cálculo, financeiro).
Padronizamos o uso de timezone nas camadas de API, backend e testes.
Integramos o time de analistas de produto para revisão de regras em PRs sensíveis.

O valor do postmortem com storytelling

Em vez de documentar esse incidente em um postmortem seco e técnico, optamos por um formato com storytelling — narrando o que aconteceu como uma história.

Isso ajudou a:

Engajar o time
Facilitar o entendimento para pessoas não técnicas
Tornar o aprendizado mais memorável
Alimentar uma cultura de engenharia focada em melhoria contínua

Lições finais: RCA não é luxo — é sobrevivência

Corrigir bugs sem entender sua causa é como secar o chão sem consertar o cano.

A RCA ajuda a:

Parar de apagar incêndios recorrentes
Aprender com cada falha
Criar um sistema mais robusto
Evoluir processos de forma sustentável

Checklist para aplicar RCA no seu time

[ ] Paramos para refletir sobre o erro?
[ ] Investigamos o que mudou no sistema?
[ ] Aplicamos os “5 Porquês”?
[ ] Mapeamos as barreiras que falharam?
[ ] Produzimos um postmortem claro e compartilhável?
[ ] Tomamos ações preventivas reais?

Conclusão

Aquele bug que começou como um “problema de fuso” revelou lacunas técnicas, falhas de processo e falta de alinhamento de negócio.

Mais do que resolver o incidente, usamos a RCA para melhorar a forma como trabalhamos — e isso é o que diferencia times que só corrigem bugs dos que evoluem de verdade.

E você? Está apagando incêndios ou resolvendo o problema certo?

What still prevents QA from being strategic? (And how to change the game)

Juliano Moreno — Fri, 27 Jun 2025 15:39:35 +0000

By Juliano Moreno
QA Specialist | Platform Engineering

Everyone already understands (or pretends to understand) that quality is everyone's responsibility. That testing isn't just done at the end. That automation is essential. That QA isn't a phase, it's a flow.

But the question remains relevant:

If the theory is so clear, why does nothing change in practice?

Squads continue to outsource quality to QA. Developers keep pushing code without coverage. PMs continue to prioritize speed and deliver technical debt. And QA is still called at the end with a simple: "Can you validate this for us?"

The culture of quality remains just talk.

The biggest barrier isn't technical. It's a lazy mindset. And a lack of courage.

What's missing isn't a tool. It's commitment.
It's not more pipelines that will save us. It's a change in behavior.

Companies say quality is a priority — but they keep QA isolated.
Developers say they test — but they leave 80% of coverage for QA.
PMs say they trust the team — but they don't involve QA in conception.
And QAs themselves often wait to be called instead of prompting change.

The truth is simple (and uncomfortable):

As long as QA is seen as the "delivery guarantor," engineering will continue to make mistakes with confidence.

Three behaviors that still hinder evolution

1. QA as a ready-code inspector

If you call QA after everything has been delivered, you don't want quality.
You want someone to clean up the mess.

2. Developer as a feature delivery person

Code without tests, without rollback, without contract, without traceability.
But the commit is there: "finished successfully."

3. PM as a backlog dispatcher

"Deliver first, then we'll see if it's good."
Let the customer experience suffer.

What each role needs to face head-on

Developers

If you think testing is QA's job, you're not an engineer.
You're a code assembler hoping it works out.
If your code has no coverage, it's not ready. Period.

Tech Leads

If you don't demand tests, don't prioritize observability, and don't ensure testable architecture, you're leading the squad towards silent chaos.

The responsibility for failures that only appear in production is yours.
Don't outsource it to QA.

Product Managers / POs

If your measure of success is just on-time delivery, you're delivering wrong.
A feature with a bug isn't delivery.
It's cost, rework, and loss of trust disguised as value.

QAs

If you're still waiting to be triggered to "validate," you're behind.
If you don't master tools, automation, test architecture, CI/CD, and risk analysis, your relevance is in countdown.

You're not here to validate.
You're here to build quality together — from the start.

Do we really need a QA Coach?

The concept of a QA Coach can be valuable in contexts where the quality culture is still in its early stages. However, it's important to reflect: if Quality Engineers and QA Managers can operate at their full potential, fostering the culture and practices from conception, the role of a QA Coach can naturally integrate into their responsibilities. Perhaps the issue isn't creating a new function, but rather empowering and demanding more from existing ones. What we truly need is more courage to apply what we already know.

The right structure already exists — but needs to be taken seriously

Strategic Quality Engineer

Translates quality policy into concrete technical decisions.
Doesn't execute tests. They provoke, structure, and ensure the team delivers reliability.

QA Lead / Manager

Defines strategy, governs practices, and measures the real impact of quality.
If limited to tracking bugs per squad? It's underutilized.

The real transformation only begins when everyone stops hiding behind QA

A developer who doesn't test is scheduled rework.
A PM who ignores QA in discovery is prioritizing failure.
A Tech Lead who doesn't demand technical quality is just managing delivery, not leading.
A QA who doesn't challenge the team is settling for the role of a filter.

Senior leadership is the pillar of quality

Quality culture doesn't just sprout from the ground. It is sown and cultivated by leadership. If directors, VPs, and CTOs don't embody quality, don't demand it, and don't reward it, all foundational efforts will be in vain. It is the role of senior management to:

Define quality as a strategic and non-negotiable priority.
Invest in training and upskilling to raise the technical bar for all of engineering.
Adjust incentive systems: Reward teams for quality and stability, not just raw delivery speed. Costly production errors should lead to reflection and learning, not just temporary fixes.
Break down silos: Encourage collaboration and shared responsibility, making it clear that quality is a success metric for the entire product, not just a single team or individual.

Change hurts, but the cost of non-quality hurts more

Thinking that "speed" means delivering anything quickly is an expensive illusion. A bug in production isn't just an inconvenience; it's direct damage:

Rework: The time spent fixing post-launch bugs is far greater than preventing them early on.
Revenue loss: Frustrated customers migrate to competitors.
Brand damage: Your company's reputation is harmed.
Operational cost: Incidents generate overtime, on-call teams, and can even lead to fines or compliance issues.

View quality as a strategic investment, not a cost. Every hour invested in prevention is a saving of days (or weeks!) of future headaches and losses.

Conclusion: Who has the courage to build quality?

QA won't disappear. But those who don't evolve will.
The squad that still treats QA as the final stage has already lost. They just haven't realized it yet.

The question isn't "who tests."
The question is: who has the courage to ensure that quality is being built every day — from the first commit?

And, more importantly: who has the courage to be the voice of quality, even when it's unpopular or challenges the status quo?