Forem: Aman Agrawal

How We Reorganised Engineering Teams at Coolblue for Better Ownership and Business Alignment

Aman Agrawal — Wed, 07 Feb 2024 20:28:24 +0000

In this post, I will share my experiences leveraging Domain Driven Design strategies and Team Topologies to reorganise two product engineering teams in the Purchasing domain at Coolblue (one of the largest e-commerce companies in the Netherlands), along business capabilities to improve team autonomy, reduce cognitive load on teams and improve our architecture to better align with our business.

Disclaimer : I am not an expert in Team Topologies, I have only read the book twice and spoken to one of the core team members of Team Topologies creators. I am always looking to learn more about effectively applying those ideas and this post is just one of the ways we applied it to our problem space. YMMV!🙂

Context

Purchasing domain is one of the largest at Coolblue in terms of the business capabilities we support and the number of engineering teams (4 as of this writing, possibly growing in the future) and it has one very critical goal: to ensure we have the right kind of stock available to sell in our central warehouse at all times without over or under stocking and secure most optimum vendor agreements to improve profitability of our purchases. Our primary stakeholders are supply planners and buyers in various product category teams that are responsible for various categories of products we sell.

We buy stock for upwards of tens of thousands of products to meet our growing customer demands, so its absolutely critical that not only we are able to make good buying decisions (which relies on a lot of data delivered timely from across the organisation) but that we’re also able to manage pending deliveries and delivered stock efficiently and effectively (which relies on timely and accurate communications with suppliers).

Growth of the Purchasing Domain

Based on the strategic Domain Driven Design terminology Purchasing would be categorised as a supporting domain i.e. Purchasing capabilities are not our core differentiator. The workings of the domain are completely opaque to end customers. Most organisations will have similar purchasing processes and often similar systems (sometimes these systems are bought instead of being built).

However, over the last 10 years Purchasing domain has also increased in complexity, we have expanded our business capabilities: data science, EDI integration, supplier performance measurement, stock management, store replenishment, purchasing agreements and rebates etc. We have come to rely on more accurate and timely data to make critical purchasing decisions. Being able to quickly adapt our purchasing strategies during COVID-19 helped us stay on our business goals. For the most part we have built our own software due to the need to tackle this increased complexity, maintain agility in the face of global upset events and to integrate with the rest of Coolblue more effectively and efficiently. The following sub-domain map shows a very high level composition of the Purchasing domain:

High level break down of the Purchasing domain (simplified)

For this post, I will be focussing on the Supply sub-domain (shown in blue above) where we redesigned the engineering team organisation.

Domain vs Sub-Domain vs Bounded Contexts vs Teams

In DDD terminology, a sub-domain is a part of the domain with specific logically related subset of overall business responsibilities, and contributes towards the overall success of the domain. A domain can have multiple sub-domains as you can see the above visual. A sub-domain is a part of the problem space.

Sometimes it can be a bit difficult to differentiate between a domain and a sub-domain. From my pov, its all just domains. If a domain is large and complex enough, we tend to break it down into discrete areas of responsibilities and capabilities called sub-domains. But I don’t think this is hard and fast rule.

A bounded context is the one and only place where the solution (often software) to a specific business problem lives, the terminology captured here is consistent in its usage and meaning. It represents an area of applicability of a domain model. E.g. Supplier Price and Availability context will have software systems that know how to provide supplier prices and stock availability on a day to day basis. These terms have an unambiguous meaning in this context. The model that helps solve the problem of prices and stock availability is largely only applicable here and shouldn’t be copied in other bounded contexts because that will duplicate knowledge in multiple places and will introduce inconsistencies in data leading to expensive to fix bugs. Bounded contexts therefore provide a way to encapsulate complexities of a business concept and only provide well defined interfaces for others to interact with.

In an ideal world each sub-domain will map to exactly one bounded context owned and supported by exactly one team, but in reality multiple bounded contexts can be assigned to a sub-domain and one team might be supporting multiple bounded contexts and often multiple software systems in those contexts.

Here’s an illustration of this organisation (names are for illustrative purposes only):

An illustration of relationship between domain, sub-domain and bounded contexts (assume one team per sub-domain)

I am not going to go into the depths of strategic DDD but here are some excellent places to study it and understand it better. The strategic aspects of DDD are really quite crucial to understand in order to design software systems that align well with business expectations.

Old Team Structure

Simply put, the Supply sub-domain is primarily responsible for creating and sending appropriate purchase orders for products that we want to buy, to our suppliers, and managing their lifecycle to completion. There are of course ancillary stock administration related responsibilities as well that this sub-domain handles but not all of those software-ified…yet.

Historically, we had split the product engineering teams into two (the names of the teams should foreshadow the problems we will end up having):

Stock Management 2 : responsible for generating automated replenishment proposals and maintaining pre-purchase settings, and

Stock Management 1 : responsible for everything to do with purchase orders, but also over time responsibilities of maintaining EDI integration and store replenishment also fell on this team.

Both teams though had a separate backlog, they shared the same Product Owner and the responsibilities allocated to the teams grew…”organically”, that is to say, the allocation wasn’t always based on team’s expertise and responsibility area but mostly based on who had the bandwidth and space available in their backlog to build something. Purely efficiency focussed (how do we parallelise to get most work done), not effectiveness focussed (how do we organise to increase autonomy and expertise, and deliver the best outcomes for the business).

Because of this mindset, over time, Stock Management 2 also took on responsibilities that would have better fit Stock Management 1 e.g. they built a recommendation system on top of the purchase orders, something they had very little knowledge of. They ended up duplicating a lot of purchase order knowledge in this system – they had to – in order to create good recommendations. This also required replicating purchase order data in a different system which would later create data consistency problems.

As a result, dependencies grew in an unstructured and unwanted ways e.g. a lot of database sharing between the two teams, complex inter-service dependencies with multi-service hops required to resolve all the data needed for a given use case. The system architecture also grew “organically” with little to no alignment with the business processes it supported and the accidental complexity increased. Looking at the team names, no one could really tell what either teams were responsible for because what they were responsible for was neither well documented nor stable.

We ended up operating in this unstructured way until July 2023.

Trigger for Review

The trigger to review our team boundaries came in Q1 2023, when we nearly made the mistake of combining two teams into one single large team with joint scrum ceremonies with a proposal to add more process to manage this large team (LeSS). None of it had taken into account the business capabilities the teams supported or the desired state architecture we wanted. It was clear that no research had been done into how the industry is solving this problem, and it was being approached purely from a convenience of management pov.

Large teams, specially in a context that supports multiple business processes, is a bad idea in many ways (some of these are not unique to large teams):

Large teams are expen$ive, you’d often need more seniors on a large team in order to keep the technical quality high and technical debt low
No real ownership or expertise of anything and no clear boundaries
Team members are treated as feature factories instead of problem solving partners
Output is favoured over outcomes, business value delivered is equated to story points completed
Cognitive load and coordination/communication overhead increases
Meetings become less effective and people tend to tune out (I tend to doodle geometric shapes, its fun !😉)
Product loses direction and vision, its all about cramming more features which fuels the need to make the team bigger. Because of course, more people will make you go faster…NOT!
Often more process is required to “manage” large teams which kills team motivation and autonomy

This achieves the exact opposite of agility and we saw these degrading results when for a brief amount of time we experimented with the large team idea.

Joint sessions were becoming difficult and inefficient to participate in (not everyone can or will join on time)
Often team members walked away with completely different understanding and mental models which got put into code 😱.
Often there was confusion about who was doing what which increased the coordination overhead
Given historically the two teams had been separate with their own coding standards and PR standards, there often was friction in resolving these conflicts which slowed down delivery and reduced inter team trust.

Communication overhead grows as number of people in the group increases

The worst part of all of this is learned helplessness! We become so desensitised to our conditions that we accept it the sub-optimal conditions as our new reality.

So combining teams and adding more process wasn’t going to be the solution here and it most certainly shouldn’t be applied without involving the people whose work lives are about to be impacted i.e. the engineering teams.

These reorganisations should also not be done devoid of any alignment with the business process because you risk system architecture either not being fit for purpose or too complex for the team(s) to handle because all sorts of assumptions have been put into the design.

Team Topologies and Domain Driven Design

I had a feeling that we needed to take a different approach here and by this time I had been hearing a lot about Team Topologies so I bought the book (highly recommended), and read it cover to cover…twice…to understand the core ideas in it. A lot of people know about Conway’s Law but Team Topologies really brings the double edged nature of Conway’s Law into focus. Ignore it at your own peril!

This Comic Agile piece sums up how that realisation after reading TT book, dawned on me:

Check out more hilarious strips here

Traditionally, team and domain organisation in most companies, has been done by business far removed from the engineering teams, meaning they miss out a critical perspective in those discussions: that of the system architecture. And because the team design influences software design, many companies end up shooting their foot with unwieldy and misaligned software that delivers the opposite of agility. This is exactly why its crucial to have representation from engineering in these reorganisations. Just because something works doesn’t mean, it’s not broken!

By this time we had also conducted several event storming sessions for the core Supply sub-domain (for the entire purchase ordering flow) to identify critical domain events, possible bounded contexts and what we want our future state to be. I cannot emphasise enough how important this kind of event storming can be in helping surface complexity, potential boundaries and improvement opportunities to the current state.

Putting Team Topologies and strategic DDD together to create deliberate team boundaries was just a no-brainer.

Don’t worry, you are not meant to read the text, the identified boundaries are more important

Also worth bearing in mind that this wasn’t a greenfield operation, we had existing software systems that had to be mapped onto some of the bounded contexts, at least until we can determine their ultimate fate. Some of the bounded contexts had to drawn around those existing systems to keep the complexity from leaking out to other contexts.

Brainstorming on New Team Design

In May 2023, I, our development lead and our domain manager got to brainstorming on how can we organise our teams not only for efficiency but this time crucially also for effectiveness.

In these discussions I presented the ideas of Team Topologies and insights from the event storms we had been doing. According to Team Topologies, team organisations can essentially be reduced to the following 4 topologies:

Four fundamental topologies

Based on these and my formative understanding, I presented the following team design options:

The 2 team model

This model makes the Purchase Ordering team (stream aligned) solely responsible for full purchase order lifecycle handling, including the replenishment proposals (which is an automated way to create purchase orders). The Pre Purchase Settings team (platform team) will provide supporting services to the PO team (e.g. supplier connectivity and price & availability services, purchase price administration services, various replenishment settings services etc).

Another model was this:

The 3 team model

In the 3 team model, I split out the replenishment proposals part out of Purchase Ordering team, added the new actionable products capability that we were working on, to it and created another stream aligned team: Replenishment Optimisation Team. The platform team will now provide supporting services to both these stream aligned teams and the new optimisation team will essentially provide decision making insights to purchase ordering team.

In a perfect world, you want to assign one team per bounded context and as evident from the event storm we had several contexts, but Team Topologies also warns us to make sure the complexity of the work warrants a dedicated team. Otherwise, you risk losing people to low motivation, and still bearing the cost of creating multiple teams.

Nevertheless, after taking into account the practical constraints like money, complexity and team motivation but perhaps most importantly taking into account the impact of each design on the overall system architecture and what we wanted our desired state architecture to look like, we settled on the following cut:

Final team split

Basically, at their core, the Purchase Order Decisions team will own all components that factor into purchasing decision making:

Replenishment recommendation generation
Purchase order creation and verification
Actionable product insights

And the Purchase Order Management team will own all components that factor into the management of lifecycle of submitted purchase orders (I know “management” is a bit of a weasel word, but I am hoping over time we will be able to find a better name):

Purchase order submission
Purchase order lifecycle management/adjustments (manual and system generated)

The central idea behind this split being that purchase order verification is a pivotal event in our event storm and once a purchase order is verified, it will always be submitted. Submission is a key pre-condition to managing pending purchase order lifecycle and it has sufficient complexity due to communication elements involved with suppliers and our own warehouse management system, so it makes sense for Purchase Order Management to own everything from submission onwards. This also makes them the sole owner of the purchase order database and this breaks the shared database anti-pattern, and relies on asynchronous event driven communication between the bounded contexts owned by the teams. Benefit of this is that we can establish clearer communication contracts and expectations without knowing or needing to know the internals of another context.

In addition to this, we also identified several supporting capabilities/bounded contexts for which the complexity just wasn’t high enough to warrant a separate team entirely, at least for now:

Supplier price and availability retrieval
EDI connection management
Despatch advice forwarding
E-mail based supplier communication

These capabilities still had to be allocated between these two teams, so based on whether they belong more to decision making part or the management part, we created the following allocations:

Supplier price and availability retrieval (Purchase Order Decisions because its only used whilst creating replenishment recommendations and subsequent purchase orders)
EDI connection management, Despatch advice forwarding (Purchase Order Management because they already owned this and it definitely didn’t make sense as a part of decision making flows)
Email based supplier communication (Purchase Order Management because purchase order submission can happen via EDI or via E-mail so it makes sense for them to own all aspects of submission)

This brought the final design of teams to this:

Final team cut with bounded contexts owned by each

It might seem a bit excessive to have multiple bounded contexts to a single team and like I said, in a perfect world I would have one team be responsible for only one bounded context. But considering the constraints I mentioned before (cognitive load, complexity of challenge and financial costs of setting up many teams), I think this is a pragmatic choice for now. The identified bounded contexts are also not set in stone, so its entirely possible we might combine some of them into a single bounded context based on conceptual and linguistic cohesion. We might even split them out into dedicated teams should some bounded contexts grow in complexity enough to warrant separate teams.

NB: A bounded context might not always mean a single deployment unit (i.e. a service or an application). A single BC can map to one or more related services if the rates of change, fault tolerance requirements and deployment frequencies dictate as much. The single most important thing about BCs is that they encapsulate a single distinct business concept with consistent business language and consistent meanings of terms, so its perfectly plausible that there are good drivers for splitting one BC into multiple deployment units.

Some heuristics for determining bounded contexts

Go Live!

In June 2023 we presented this design to both the teams and asked for feedback, and both teams could see the value of the split because it created better ownership boundaries, better focus and allowed opportunity to reduce the cognitive overhead of communicating within a large team. So in July 2023, we put the new team organisation live and made all the administrative changes like changing the team names in the HR systems, Slack channels, assigning right teams to code repositories based on allocations etc. and got to work in the new set up.

Reflection

Whilst this team organisation is definitely the best we’ve ever had in terms of cleaner ownership boundaries, relatively appropriate allocation of cognitive load, better sense of purpose and autonomy, its by no means the best we will ever have. The most important thing about agility is continuous improvement, DDD tells us that there is no single best model, so it only makes sense that we revisit these designs regularly and seize any opportunities for improvement along any of those axes to keep aligned with the business and deliver value effectively. The organisation and the domain never stay the same, they grow in complexity so its crucial for engineering teams to evolve along with them in order to stay efficient and effective themselves, and also for the architecture to stay in alignment with the business. I loosely equate teams and organisations to living organisms that self-organise like cellular mitosis, its the natural order of things.

Ofcourse things are not perfect, both teams still have some degree of functional coupling i.e. if the model of the purchase order changes fundamentally or if we need to support new purchase order types, both teams will need to change their systems and coordinate to some extent. This is a trade-off of this team design option, but largely the teams are still autonomous and communicate asynchronously for the most part. Any propagation of model changes can still be limited by use of appropriate anti-corruption layers on either side.

One of the other significant benefits of this deliberate reorganisation, is that in both teams we created a north star roadmap for the desired state architecture because for a long time, both teams had incurred unwarranted technical complexities in the form of arbitrarily created services with mixed programming paradigms, which were getting difficult to maintain for a small team. Contract coupling at multiple service integration points made smallest of changes ripple out to multiple systems that had to be changed in a specific order to deploy safely (we’ve had outages in the past because we forgot to update the contracts consistently).

As a part of our new engineering roadmap, we are now reviewing these services with a strategic DDD eye and asking, “what business capability this service provides?” and if the answer is similar for two services and there are none of the benefits of microservices to be gained here, then those two services will be combined into a single modular monolith. Some services will not make sense in the new organisation so they will be decommissioned and the communication pathways simplified. We project a potential reduction of 40% in the complexity of overall system landscape because of these changes (and hopefully some money savings as well), or at the very least complexity will be better contained. But perhaps most importantly, we aim to make the architectural complexity fit the cognitive bandwidth of the teams and ensuring a team can own the flow end to end.

Another thing we will be working on next is strengthening our boundaries with dependent teams, historically the e-commerce database has been shared with all the teams in Coolblue and this creates challenges (subject for another post). So going forward we will be improving our web services and events portfolio so dependents can use our service contracts to communicate with our systems instead of sharing databases. With a better sense of what we own and don’t own, I expect these interfaces to become crisper over time.

These kinds of reorganisations can have a long maturity cycle before it becomes clear whether these decisions were the right ones or the team boundaries were the right ones, and organising teams is just the first though a significant step. The key is in keeping the discussion going and being deliberate about our system design decisions to ensure that business domains and system design stay in alignment. To that end we will continue investing in Domain Driven Design practices to ensure business and engineering can collaborate effectively to create systems that better reflect the domain expectations whilst keep the complexity low and keeping acceptably high levels of fault tolerance and autonomy of value delivery.

An Exercise in Domain Modeling Guided By Strategic Domain Driven Design – Part 2

Aman Agrawal — Wed, 10 Jan 2024 12:05:38 +0000

In part 1 I showed how I leveraged event storming to map out the whole flow for my personal financial health maintenance “domain”, identified major sub-domains and mapped them to bounded contexts, drilled in deeper into process and design level event storming to enrich the storm with commands, read models, policies and systems.

In this final part, I will review the current domain model for the Budgeting and Expense Tracking bounded context, explore alternatives and make some model improvements keeping in mind the outcome of the design level event storm. Finally I will end with some DDD takeaways that should be applicable generally.

Budgeting and Expense Tracking Domain Model with Aggregate

From the process level event storm, three different things have emerged: budget, pot and expense. Budget is opened first, then pots are allocated in that budget, then expenses are recorded (or reversed) against those pots and then eventually the budget reaches its end date and is closed, all based on some constraints that must be met. Each budget has a lifecycle of about a month (called a “Period”) and at the end of that period, no more changes are allowed to the budget i.e. it, all the pots and expenses in it, become immutable.

“Pot” is just the name I chose for expense categories because in ancient times money would often be stored in earthen pots and jars – just my ubiquitous language

The flow is strongly unidirectional i.e. open budget -> pot -> expense -> close budget where each step depends on the step before it and the constraints in each step often come from the step before it. So even though the terminology switches from budget to pots to expenses, because of the strongly interdependent and related nature of the flow it makes sense to keep all of those steps in one bounded context.

Based on this thinking I created an aggregate that looks like this:

Current domain model

A Period can allocate one or more Pots and one or more Expenses can be added or reversed against each Pot. Period is at the root of this aggregate, meaning, any and all changes to the state of any and all constituent entities can only be made via the aggregate root. As an example:

Role of an Aggregate

A DDD aggregate is often a confusing and a misinterpreted idea and much has been talked and discussed about it so I am not going to redefine it, but I will attempt to describe how I understand it.

An aggregate is a logically related cluster of objects that exposes certain domain behaviour(s) and enforces invariants i.e. constraints that should hold true for all objects in a group, for all state changes done to them. These are checks that prevent invalid state transitions for a given object.

It also allows me to reduce the exposed API surface area of the aggregate and minimise out of band changes that could compromise the integrity or validity of the aggregate. The root object doesn’t have to do everything itself or know how everything will be done, it delegates to the child objects in its graph. For example, the Period object doesn’t know or care exactly how an expense is added, it delegates that to the Pot object. The Pot object knows how to record the expense correctly, and to ensure no shortcuts around the Period object are possible, methods in the Pot object are marked as internal. Once expense is successfully added i.e. no exception gets thrown, the root object will then execute post-conditions e.g. updating the remaining allocation on the pot or anything else it might need to do. Bypassing the root will compromise these checks and the integrity of the aggregate.

Some of the most useful design heuristics/constraints for an aggregate that I have come to understand are:

It should have invariants that need to be enforced for the whole aggregate, e.g. if I have one or more expenses recorded against a pot, I am not allowed to deallocate that pot or I am not allowed to add an expense if its date is outside the period date range. So these become invariants that I will enforce via the Period aggregate root because it is the only entry point into the aggregate.

Another example of this could be a purchase order with many purchase order lines, say the overall value of a purchase order should be limited to some threshold based on what type of product is being ordered and you want to enforce this limit. Then, the PO aggregate will be retrieved each time you want to add/modify a line to it to make sure the invariant of max value limit (price x quantity), is enforced across all the lines and the change rejected if the invariant is broken. In this case we are trading off performance for correctness and reliability.

Its also possible that there are no aggregate spanning invariants like the value one here, but there are local invariants specific to an object in the aggregate e.g. price corrections on a given purchase order line are only allowed between the range of $10 and $100, in this case, it might not make sense to pay the cost of retrieving a potentially large aggregate so it might be more acceptable to retrieve the target line and make changes to it. You still want to make sure that the the line object is responsible for making changes to its own state based on these constraints, we are only foregoing the one-to-many nature of a purchase order for performance reasons, not foregoing encapsulation of state and behaviour on each object.

An aggregate must be transactionally consistent i.e. all changes made to the aggregate in a single flow/use case, MUST all be persisted successfully or they should all fail (i.e. atomicity) otherwise the aggregate will have data consistency issues. Using our purchase order aggregate example, if I add/modify multiple purchase order lines for a given purchase order, I am expecting all these changes to succeed or all of them to fail so that the end state of the system is consistent. The key influencing factor for this requirement is preserving the symmetry between invariant enforcement and ultimate persistence of valid changes. In other words, if all the invariants for the modified objects have been satisfied, then they should all be made durable successfully or they should all fail, otherwise the invariant enforcement is of little value if the end state of the system is still inconsistent such that subsequent invariant checks might fail again.

You can always break apart the aggregate where transactional consistency and invariant enforcement are not needed, and accept eventual consistency instead.

Because the whole aggregate needs to be retrieved from persistence in order to enforce invariants, aggregates must be small enough to not impact data access performance negatively. Just because there is a one to many relation between objects/entities, doesn’t always mean that’s a good aggregate e.g. order and order lines have a natural one to many relation but an order can have thousands of order lines, which could make it a poor aggregate from a performance pov. Pulling thousands of objects (or more) every single time from the database could hurt performance of the system for no benefit to the consistency of the model. Yet, if there are aggregate wide invariants to be enforced we might take the performance hit, its a trade-off, though at that point we might also want to improve the design of the aggregate.

Of course persistence will become tricky the more objects in the aggregate are modified in one transaction, you’d need to know what’s changed in order to atomically commit the changes correctly. Most ORMs like Entity Framework, might do this change tracking heavy lifting for you and that could ease up persistence concerns, with the trade-off that you become dependent on an external framework for your domain model to work efficiently. This is also why small aggregates are preferred.

An aggregate doesn’t necessarily have to mean one object containing an IList of other objects so to speak, it could also be a composition of smaller singular objects. The whole object is still an aggregate if it matches the above criterion.
Two aggregates are independent from each other from a transactional consistency pov i.e. each can fail independently. Each is immediately consistent with respect to itself but eventually consistent with respect to the other.
An aggregate is only allowed to refer to another aggregate by the id of its root, in other words, its not allowed direct reference to the internal objects of another aggregate. This is because if the internals change, that can leave multiple objects in a system broken or inconsistent.

Reviewing the Period Aggregate

The first thing that I avoided pointing out until now, is the mismatch in naming of the aggregate root between the event storm and the actual aggregate model. The event storm captured the ubiquitous language and used the term “budget”, whereas in code I used the very generic and obfuscated term “period”. So that’s Improvement number 1 I made. It required quite some refactoring across the whole system as I made names consistent everywhere but this was an important change.

From here on out I will use the ubiquitous term “budget” to indicate the finite time period within which I create pots and record expenses.

Next, I checked this budget aggregate against the constraints that I laid out above:

Invariants

There is a global invariant that applies to all state transitions within the aggregate: the budget should still be active (i.e. not closed) at the time of desired state change. Then there are behaviour specific invariants

Opening a budget : dates must be in the future, not past
Closing a budget : budget can only be closed a day prior to its end date and only if its not already closed. Like I said before, closed budgets are immutable, i.e. I cannot allocate pots or add expenses to them. They are done!
Allocating a pot/deallocating a pot : pot cannot be deallocated if it has one or more expenses against it.
Recording expenses: expense date for all expenses should fall between the budget date range and expense amount shouldn’t be negative. Remaining allocation for the pot should be reduced by the expense amount such that, remaining allocation + sum(expenses in this pot) = total allocation in this pot, this is also potentially an invariant because it indicates a relation or a constraint at the pot level that should hold true for all expenses in that pot.
Reversing expenses: Same as recording expenses from the pov of updating remaining allocation, so an invariant at the Pot level.

Transactional Consistency

All the above state changes are persisted atomically to ensure consistency but each state change typically only involves a single object in the aggregate. This is because the changes are not all made in one go i.e. budget opening, pot allocation, expense adding etc can’t happen in one go because I may not know what pots I will allocate before hand, I will definitely not know what expenses I will make before hand. These operations are spread out over the period of a month or more. So the scope of the transaction is often restricted to a single object at a time which keeps transactions and database locks, short and quick.

These kinds of insights come from talking to the domain experts and understanding the business process. As developers we’ve been indoctrinated into ACID properties, so we are driven to force ACID on every little operation. If the business process is eventually consistent or can be eventually consistent without negative impact to the business, don’t force ACID on it. If all you have is a hammer…and all that!

Another example of applying appropriate technical solutions to solve a business problem instead of forcing out of band technical decisions on the business, my domain model has no concurrency handling mechanism implemented for the aggregate. This is because there are only 2 users so its easy to ensure only one is using the system at a time to avoid contention. Granted, no body else gives a shit about my application and its a tiny little pet application, but then that’s all the more reason for me to go crazy and implement all sorts of cool things as an engineer. I don’t!

Aggregate Size

How about the size of the aggregate? Well, for my personal expense tracking application, a budget can realistically only contain less than a dozen or so pots (p) and each pot will typically contain less than a hundred expenses often even less than 50 (e), so the overall cyclomatic complexity will be: 500 <= O(p.e) <= 1000objects retrieved if I retrieve the whole aggregate. From experience of running the application for the past 5 years with this model, I can tell that retrieval performance has not been a problem. Most queries are in single digit milliseconds.

Of course, there is no hard limits on these so if “p” and/or “e” increase enough for performance to be a bottleneck, I will have to remodel. If the read performance is to be improved I might create dedicated read only models with the possibility that they will be eventually consistent. Once again its trade off between: slow and immediately consistent vs fast and eventually consistent

Alternate Models I Considered

There is almost never just one true model that will always work for all kinds of problems, there are usually multiple models that we need to consider to identify which one fits our problem the best.

“Nothing is more dangerous than an idea when it is the only one you have.”

Emile Chartier Alain

To that end, let me sketch out couple of alternative models that I considered, and see how the dynamics of invariants and transactional consistency look for each:

Budget and Pot Aggregates

Separate Budget and Pot aggregates

In this model, budget and pot are separate aggregates because most of the budget level state changes are constrained by invariants in the budget itself. They have no bearing on the rest of the aggregate. So separating budget into its own albeit very small aggregate, might make sense.

Pot and expenses seem more cohesive together, there are invariants (like the deallocation one), where a pot needs to know how many expenses its recorded so it might make sense to put them together into another aggregate.

However, Pot (de)allocation and expense recording also require the budget to be active i.e. the global invariant but since the aggregates are now separated, the most Pot can do is reference BudgetId (id of the aggregate root) and hope that the budget is active when the state changes are being performed. There is no easy way to guarantee that the referenced budget is active and what’s worse, is that the invariant enforcement starts to move out of the objects and into some kind of service class (see example below). The more we break up the aggregate, the worse this leakage gets.

For comparison sake, here’s the service class code for the currently implemented budget aggregate model, for adding a new expense. Notice the reduced branching complexity due to all the major invariants being enforced inside the Budget object (ignoring the comment):

Performance wise also it doesn’t win much since in order to use the system effectively, the whole budget (with pots and expenses) have to be retrieved anyway. Also if the number of expenses are sufficiently high, even with a smaller aggregate, retrieving expenses will still be slower. Sure, we can split the retrieval across all 3 objects and do on demand loading with paging etc but this model still leaves something to be desired. The interaction complexity between the aggregates goes up with eventual consistency being introduced in a place where it doesn’t naturally fit (see the event storm).

Budget and Expense Aggregates

Separate Budget and Expense Aggregates

This model is similar but gets even more complex, in order to make sure expense is recorded against the right active budget and pot, the expense aggregate needs to reference both PotId and BudgetId. In order to satisfy the deallocation invariant, Pot needs to keep a collection of ExpenseId. The same problem of leaking invariant enforcement still exists in this model, just a lot worse and also since the transactional consistency boundaries are split up, both these aggregates can fail independently which could introduce data consistency issues where there were none.

Budget, Pot and Expense Aggregates

Each object is its own aggregate

In this model, each object is its own aggregate referencing each others’ ids, with transactional consistency only limited to one object at a time and eventual consistency across aggregates. Enforcing invariants is fully leaked out to the service class because the domain objects don’t have much behaviour to speak of and not enough state to do enforcement, meaning a lot of communication and coordination overhead between aggregates. This makes the code a lot more difficult to reason about, its almost akin to the entity service anti-pattern.

Whilst its entirely possible to adopt any one of these models to solve the problem of budgeting and expensing, none of them are necessarily wrong, but each of these alternates make invariant enforcement more difficult and complex, and that defeats the point of an aggregate. If these objects don’t have any behaviour nor are there any invariants to be enforced, then they could simply be treated as anaemic entities in a CRUD style application using the TRANSACTION SCRIPT pattern.

⚠️⚠️⚠️Anaemic model should almost never be the starting point, effort should instead be put in identifying behaviour and invariants and capturing those in an expressive domain model. Only when its abundantly clear that no behaviour or invariants exist and its just data shovelling, should an anaemic model be adopted.

Improvements Made to the Budget Aggregate

Having gone through the alternates and seeing their weaknesses, I decided not to change the current model that fits better to the problem at hand. However, this whole exercise made me challenge all my assumptions about the system, domain and the practice of DDD, so I did come away with improvements that I made to the budget aggregate model.

Improvement 1: Make the domain model follow the ubiquitous language

Renaming Period to Budget and other changes to align with the ubiquitous language discovered in the event storm, like mentioned before. Thinking about it now, I should have always stuck with Budget not really sure why I chose Period back in the day to refer to budget, its very generic and vague.

One other improvement I can see based on DDD recommendation is, instead of Budget, the phrase MonthlyBudget will make it explicit what’s the lifecycle of this object and whilst opening a new monthly budget, the correctness can be enforced by controlling how dates get assigned. One for later!

Improvement 2: Get rid of fake invariants

First off, the purpose of the “remaining allocation” invariant was not clear. Looking at the code, the only place it was used was in the frontend applications for display purposes. There is no other case that I discovered in the event storm that highlighted the need for such an invariant. What’s more is that this computed value was stored in the database so it doesn’t have to be computed again and again.

Calculating this value is not complex or performance intensive but more importantly, I shouldn’t be creating aggregated values for display purposes in the domain model. Domain model should be presentation agnostic, this value can be calculated on the edges of the system: REST API or frontend, and that will allow me to get rid of this fake invariant from the aggregate. So that’s another change I made, got rid of this remaining allocation “invariant”, updated the frontend to calculate this value from raw allocation and spend values, removed the corresponding properties from the aggregate entirely and ultimately, dropped the column from the database table. Good win!

The current model also computed the total saving after closing the budget, and stored (or posted) these savings back to the database and once again, this responsibility was better handled in the summarisation bounded context because the total savings for the closed budgets was once again only used to display closed budget summary on the frontend. So that’s another bit of accidental complexity I removed from the aggregate and dropped another database column.

Improvement 3: Make Bounded Contexts Clear in Modular Monolith

Next thing looking at the event storm that hit me was there were 3 clear bounded contexts yet in the monolithic code, it was difficult to find these bounded contexts.

Solution structure (before)

Bounded contexts boundaries often indicate module boundaries and in a monolithic system, its important to surface these modules to make the boundaries clearer. This allows for future evolution into a more distributed design where each of these modules could become an independent deployment unit, should the need arise. Taking a first crack at making the monolith more modular, I created the following structure:

High level modules i.e. bounded contexts (after)

Ok, API and Frontend are not quite modules strictly speaking but they do represent entry and exit points in the system. Budgeting and User Administration modules map to the bounded contexts in the event storm. The one “missing” is Summarisation module, but for now I decided to keep it co-located within the Budgeting module, I shall pull it to the top soon.

Another benefit to this modularisation is that when I am working on one module, I don’t have other modules in my way. Its literally harder to navigate to a whole another project to get to another module than it is to navigate to another namespace in the same project, more clicks! 😉. I can therefore focus on one module that I am working on and keep all its context in my head.

I also made these modules almost plug and play by introducing DI hooks, that way if I don’t register a module into my main API, it wouldn’t be used. Each DI hook takes the responsibility of registering all components required for that module to work.

Then in order to plug this module into my main API:

This makes the overall modular structure and direction of dependencies clearer:

Whether or not it will be as easy to pull out “microservices” from this kind of a structure in general as I think or if the DI hook type design is robust, is still to be seen because this is still something I am settling into and still trying out different structures.

Improvement 4: Implementing Domain Events

The event storm also highlighted key domain events in the Budgeting and Expense Tracking context that I never captured in code before. When I first thought about adding these to the code, I almost went the whole hog with implementing TRANSACTIONAL OUTBOX pattern to store emitted domain events durably. That way I could publish them outside of the bounded context whenever I wanted. However, realising that at the moment aside from Budget Summarisation context, there is no other consumer for these events (and likely never going to be) and the fact that Budget Summarisation context needs some more refinement as to is current and future capabilities, I decided that I am not going to implement the TRANSACTIONAL OUTBOX pattern and avoid that architectural complexity and cost.

But what I did decide to do was model these events in the domain code so I can at least capture them when the state changes and verify them in tests. For now these events will not be durable because they will just stay in memory as long as the aggregate stays in memory, but that helps me amortise the cost of event implementation over a period of time.

In order to record these events in the aggregate, I made the aggregate root responsible for keeping a track of all events in memory:

The advantage of this approach is that the recorded events are colocated with the aggregate, meaning they are agnostic to publishing strategy, I can either publish them right away after the state change (this is less reliable because database update and event publishing are not atomic), or I can implement TRANSACTIONAL OUTBOX and store the events in the database durably and atomically, and publish them later. This approach is also a lot cleaner i.e. if you believe domain events should be treated as a first class part of the domain model.

I have documented some lessons architecting event driven systems here and here, might be worth a look because I am not going to go in more details here.

This is not the end of the improvements, DDD is a process of continuous learning so I am sure in a few months, I might revisit the domain model again as my understanding of DDD and the system design evolves, and see more areas of improvement. For example, Budget Summarisation context is something that I have not really put a lot of effort into due to my reporting needs not being too complex at the moment but that could be the next thing I will want to look at in order to realise the overall goal. I also have to revisit all my primitive types and see where the model can benefit from Value Objects, though there is nothing wrong with using primitive types where it makes sense.

Takeaways

Taking a database centric view of the system is likely going to result in unwieldy code that doesn’t capture the domain behaviour and doesn’t speak the language of the domain. What’s worse, you as an engineer aren’t going to be able to speak the language of the domain. Having to constantly translate between tech and business contexts leaves room for total misinterpretation and assumption, and this is where most complexity in software comes from. Learning the business domain and engaging with the domain experts regularly to sharpen your domain language is crucial, you will view code a lot more differently.
Approaching DDD from a purely tactical point of view i.e. jumping straight into entities, value objects, aggregates or repositories (like I did years ago), is likely to put you at odds with your business’ value streams and the models you create are likely to be either over-engineered or under-engineered. They might not use the ubiquitous language and you are going to miss out on a whole bunch of business concepts that gives your model meaning, and this at best creates fake and anaemic domain models with very little behaviour and invariant enforcement, primitive obsession and general lack of expressiveness. Nothing wrong with it if that’s how you want to build software, but I believe we can do better than that.
If you are going to “do DDD”, I very much recommend starting from the strategic part of DDD: exploring the business flows collaboratively with domain experts with big picture event storming, identifying/clarifying potential sub-domains/bounded contexts (be prepared to get these wrong in the first few attempts), drilling deeper into Process and Design Level Event Storming to identify potential commands, data, systems and aggregates, creating context maps to identify relationships between bounded contexts. From these you will be better able to distil the model in code that is aligned with the business flows and ubiquitous language. You don’t have to use all the tools DDD has to offer all the time, and you don’t have to have used all of them upfront before you code. Often you are going to have to apply strategic DDD and tactical DDD iteratively in order to implement, learn and improve (or red, green, refactor if you will). Practices like TDD can also work really well to incrementally improve the model.
Aggregate design should try to strike a balance between invariants that need to be enforced, transactional consistency requirements and performance of the aggregate. Splitting an aggregate too much might make invariant enforcement more complex and domain objects potentially anaemic, and vice-versa, very large aggregates could present performance bottlenecks yet add no value, if there is no behaviour in them or no invariants to enforce.
There is no single perfect model, there are just different models for different purposes. A good enough model will almost never be arrived at in one go or linearly, the process of exploration and discovery is divergent and that’s where insights come from. Don’t stop at the first model you create, explore alternatives even if to discard them and sticking with the original implementation (like I did here). At least you’d have made the decision deliberately and you will have generated ideas for the future.
Doing an event storming and identifying domain events doesn’t always mean implementing all of those in code. Sometimes we use events to help us reason about the business process but the real life artifact might just be a database table (i.e. read model) that you read frequently. You might want to capture the important domain events in code upfront but instead of publishing them right away, just record them transiently in memory (like I did here). The need to trigger multiple business processes downstream, or to reduce temporal coupling, and/or generating analytical insights from critical domain events, are usually the triggers I look for to publish events outside the bounded context. Until then they are internal events.
Embrace eventual consistency, its the natural order of things but drop it where it doesn’t fit. If your stakeholders’ mental model and the problem space isn’t naturally asynchronous, the smallest of eventual consistency can be seen as a problem. If the problem space is naturally asynchronous or can tolerate eventual consistency without negative impact to the business process, embrace eventual consistency with both hands.
Using DDD need not result in microservices, its more important to identify boundaries instead and create a model that aligns with the business process and solves that business problem. Then you can start with a modular monolith and microfy it when there are strong enough drivers for it.
You cannot do DDD once and be done with it any more than you can write code once and be done with it, its a continuous learning and improvement process. The more you think about your domain and system, the more questions you will have and resolving those questions with domain experts will often result in new insights that can help improve the system design just that much more. Most important improvements come not from one or two large scale design changes, but from small and continuous changes (this is how biological evolution works, all the complexity and highly optimised/adapted species we see wasn’t a result of one or two large scale changes but millions of micro-increments over a long time). I was wholly confident during this exercise that I will happen upon some genius insight that will improve the system design by leaps and bounds, but look at the scale of changes: renames, removing some code and data, restructuring the solution.

Hope you found that somewhat useful, if tiring.

Resources

An Exercise in Domain Modeling Guided By Strategic Domain Driven Design – Part 1

Aman Agrawal — Fri, 05 Jan 2024 07:00:00 +0000

I have written about my pet project- my personal expense tracking application – before. This application can trace its lineage to VB 6.0 and MS Access database so that should tell you how long I have been working on this on the side. Spoiler: around 18 years!

When I first started on this project back in 2004-2005, I took a very database centric view of the design, starting with the entities I needed, mapping them to database tables in an entity relationship diagram, thinking about database interaction first etc. This is just how most people, including myself, built software back in the dark days (some people still build like that today, change is hard for some I guess) 🤷‍♂️.

Nonetheless, it is a functioning application that me and my wife use on a regular basis, and it also serves as a test/exercise bench for me to try out various ideas, practices and patterns. In around 2016-17, I redesigned the application using the DDD tactical patterns, you can read about that journey here. Because it had been 5-6 years since I created the current domain model, it was time to review and improve it based on my growth in learning about DDD in the intervening years. Its a good way to test if you have learned anything over time or just clocking more years.

⚠️ DISCLAIMER: the application is not particularly complex, the domain logic is not particularly complex. There are plenty of free and paid tools that do a fantastic job of tracking expenses and offer advanced capabilities, so I am not competing with them here. The point of this post is to showcase one of the ways DDD patterns can be applied: strategic and tactical ones, and how domain modeling can be approached, not how to build personal finance software. I personally and professionally find DDD very useful at various levels, in not only building knowledge of the domain but also reflecting that knowledge in code using the language of the domain.

Revisiting the Problem and Solution Space

In order to evolve the model, I needed to revisit the problem space, and this is where the application of strategic DDD comes in. Simply put the broader problem context can be defined as wanting to maintain good personal financial health and making sure I am able to meet expenses that are critical and minimise waste.

My broader solution context is keeping a track of expenses over time:

High level problem and solution spaces

In strategic DDD parlance this would be akin to understanding the business model and vision but something like a business model canvas was an overkill for my operation, so a broad understanding of the problem and solution contexts was sufficient to get started.

Big Picture Event Storming

Given this problem and the broader solution context, next step was to break down the domain of personal financial health maintenance into sub-domains i.e. areas of discrete responsibilities and capabilities that I can more easily reason about and design software around (or not). For this I used the big picture event storming technique:

Big picture event storm for the whole domain

In this exercise I mapped out the whole flow from start to finish using all the domain events and the roles that are involved. Of course in this 2 person operation of mine, there really are only 2 roles: Expense Administrator (me) and Expense Manager (my wife and sometimes me). There are no additional requesters because this is not a general purpose system, but I still wanted to visualise that part for completeness sake.

The BPES primarily shows only domain events across the entire business, but I also find it helpful to:

Highlight the various user roles that are responsible for generating these events (small yellow stickies). Typically, if the same user role is responsible for triggering a group of related events in close proximity, it might generally indicate that those events should be within the same sub-domain
Identify pivotal events (ones with green vertical bar in the above storm) i.e. the events that trigger other processes downstream potentially owned and operated by different user roles may be in different sub-domains or bounded contexts. These are identifiable by a change in language used (i.e. “expense manager login credentials sent” –> “budget opened” –> “budget summarised”). And,
Indicate passage of time where the events in a sequence are not strictly required to happen right away but can have a large time gap between them (think: workflows), because that can help identify aggregate boundaries that will be crucial from a data consistency pov. Things that don’t have to happen together don’t have to be immediately consistent, those can be prime candidates to be broken up further.

Taking into account the above indicators, I created the following groupings of event stickies, each representing a potential sub-domain:

Big-picture event storm with first draft of sub-domains identified

User administration sub-domain is responsible for registering users into and deactivating them from, the system. It also sends the newly registered users their login credentials. Without this crucial first step, users cannot use the system. People operating in this sub-domain are called Expense Administrators
Budgeting and Expense Tracking sub-domain is responsible for opening new budgets, allocating pots and tracking expenses against them. Eventually the budget will be closed when the time is right. People operating in this sub-domain are called Expense Managers.
Budget Summarisation sub-domain is responsible for generating summarisation reports from active and closed budgets. This serves as a useful trend tracker for spend vs saving per budget. These summaries are generated by the system responding to various events from the Budgeting and Expense Tracking sub-domain.

Looking at this grouping of sub-domains and the processes they are responsible for, each sub-domain mapped pretty much one to one with a bounded context i.e. an area of the sub-domain with consistent language and meaning of terms, that communicates with other bounded contexts using messages (synchronous or asynchronous), without revealing its internal complexity and details (hence the “encapsulated” word).

Note: in other more complex environments, this may not always be true i.e. a single sub-domain might comprise of multiple bounded contexts depending on the complexity of the business processes and value streams. I am writing another article on a real life example of such a situation, so watch this space!

There is no single best or correct split for bounded contexts and/or sub-domains, there might be multiple models that might work at different levels. Key thing to bear in mind is that models do not reflect reality 100%, they are only an approximation of reality and as such make some simplifying assumptions and put some constraints to help us design a working software system.

This doesn’t mean we play fast and loose with models, it means it might not make ROI sense to model reality 100% in a software system otherwise complexity would be too great, may be 90-95% cases covered can be a good model and the remainder get handled offline where feasible. The “correctness” or the suitability of a model, is directly proportional to its ability to solve the business problem at hand and the problems that are likely to arise in the future. The first couple of cuts are likely to be wrong or sub-optimal, so starting somewhere and iterating over time is a good idea.

Process Level Event Storming

Since we are talking about designing software, the big picture event storm is a good starting point to map the overall complexity of all key flows and identify groups of capabilities but it doesn’t give me the details necessary to converge onto a potential software design.

Therefore I need to dig deeper and discover the internal details of each of these bounded contexts, what actions are taken, with what information, using what systems, what are the reactions to various events etc. For this I used the Process Modelling variant of event storming which introduces more coloured stickies (blue for command, green for read model or data, pink for external systems, lilac for business policy, yellow for aggregates). Putting those puzzle pieces in, I got this:

Process modelling event storm (don’t try and read the text, I will explain separately)

I will narrate the story line quickly for each bounded context:

User Administration

A requester that wants access to the system, asks the expense administrator (EA) for access. The EA uses the requester’s first name, last name and their choice of user name, and a secure password to register the requester as an expense manager via the user administration system. When the expense manager is registered successfully, the login credentials are given to the new expense manager. The expense manager is now ready to use the budgeting and expense tracking system.

If the requester wants to stop using the system, they ask to be deactivated and the EA deactivates them via the user administration system. This is the terminal state for this expense manager.

In both use cases, an action failure triggers a retry until the operation succeeds.

Budgeting and Expense Tracking

The expense manager logs into the budgeting system and takes the following actions:

The system is empty or system has some past budgets but no active budget. In this case, the expense manager first opens a new budget with certain start and end date. This triggers the budget opened event. They might then decide to allocate expense pots right away or decide to do it later. Allocating a pot triggers a pot allocated event.
- If they make a mistake whilst allocating a pot and the pot has no expenses , they can deallocate the pot and reallocate the correct one. Deallocating a pot triggers a pot deallocated event
- If they make a mistake whilst allocating a pot and the pot has expenses , the system will not allow deallocating that pot because then its not clear what should happen with the expenses. Expenses are immutable, they happened in real life so we can’t just remove them.
The system already has an active budget and allocated pots. In this case, they might decide to add expenses using information like date of expense, amount spent, description and pot. For each expense added, an _ expense added _ event is triggered
- If they make a mistake whilst adding an expense (e.g. added to the wrong pot or with wrong details), they can reverse that expense and re-add it correctly. Once again, expenses are to be treated as immutable because we should have a full log of expense activity for reconciliation or diagnoses later, so editing an expense or simply removing an expense is not allowed. Reversing an expense triggers an expense reversed event
The system has an active budget that has reached its end date, at this point the expense manager can close the budget which triggers the budget closed event

I cheated here a little bit, for this part of the event storm I used the Design Level Event Storming by bringing in aggregates, instead of using generic systems, just because its the bounded context I am focussing on for remodeling. In the new revisions of this technique, they have replaced the word “aggregate” with “business constraint” but for my case I will stick with the former.

Budget Summarisation

In order to keep track of spend vs saving trends over time, this bounded context listens to some key events and creates a projection for these data points:

Pot allocated/deallocated (to keep a track of total allocation per budget)
Expense added/reversed (to keep a track of total spend per budget)
Budget closed (to trigger the calculation of total savings per budget)

Events can be used to calculate aggregated values

These calculations can be done in the following ways:

Continuously as a budget undergoes relevant changes (asynchronously)
At regular intervals by summarisation context pulling information from the budgeting context (batch), or
Just in time when an expense manager asks for it.

It depends on how computationally expensive these calculations will be and whether or not the expense manager can deal with eventual consistency i.e. this summary might be updated a little bit after the budget undergoes a relevant change. It also depends on how many other consumers are there for this information. The more computational expensive these calculations are or the more consumers of this information there are, the more sense it might make to build this in an event driven way. Otherwise, the simplicity of synchronous on-demand summary generation might suffice. This sub-domain is not the focus for the re-modeling exercise for now, so I won’t spend any more words on it.

Just because you do an event storming session and discover all the domain events, doesn’t always mean that all events need to be implemented in code and asynchrony introduced where it doesn’t bring specific value. Event driven mindset is more important here so we know that if the situation changes, we can switch to an event driven model as well. I find it valuable to not constraint the system design too much to a specific approach because that takes optionality away which reduces engineering agility. A good design decision is also reversible or is able to be implemented without significant rework, if I am not sure an event is valuable to be implemented in code but I feel it should be implemented because some day we will need it, I generally won’t add it. I would instead make sure that adding that event later is not going to be a gargantuan effort. Push comes to shove, the most I’d do is create the events in my domain model code but only record them in-memory and not publish it outside of my bounded context, until a use case becomes clearer. For this exercise, I opted for this approach but its possible that in a work setting this approach might be seen as overkill.

In the next post, I will zoom in on the core budgeting and expense tracking bounded context and talk about the current domain model implementation, the other alternatives I considered as a part of the review exercise, the final design improvements decisions I made and wrap up with some takeaways.

Stay tuned…

Resources

Monolith vs Microservices

Aman Agrawal — Sun, 01 Oct 2023 20:49:09 +0000

One of my colleagues shared this article with me a few days ago and having read through it (and many others like it before), I felt like I needed to provide hopefully a more balanced perspective to this age old debate based on my own experiences and learnings. So in this post, that’s what I am going to attempt to do.

Successful Startup != Microservices

The author shares this link to a security audit of start up codebases and emphasises on point number 2 of that article. The point being that all successful startups kept code simple and steered clear of microservices until they knew better.

I can see why, microservices are an optimisation pattern an org may need to apply when they scale beyond a certain size. When you are a fledgling startup with limited funding and an uncertain future, microservices is the last thing you should be worried about (and preferably not at all). All the time and money at this stage should be spent on generating value and treating your employees well (the latter is not optional…ever).

Here’s an example of a modern digital bank that went all in with microservices from the get go. They boast about their 1500+ microservices that a while ago invited some flak on Twitter. From my limited point of view on their context, this looks like insanity and though they touch on all the challenges that come with this kind of architecture, I can’t help but think that somewhere deep down they go, “Wish we hadn’t done this so soon!” But I am willing to give them the benefit of the doubt that they did their due diligence whilst evolving into a complex architecture and building a platform to support it.

Microservices Make Security Audit Harder

If I have 1500+ services potentially written in different languages spread across of hundreds if not thousands of repositories, my job as a security auditer just got exponentially harder! The author even mentions that in point 7 of his list: Monorepos are easier to audit. Monzo’s 1500+ Go services are in a mono-repo, so that’s one down I guess.

Also the security attack surface area also gets that much wider, can you ensure all 1500+ of your microservices leverage security hardened platform and industry best practices in a standard way? Do you even know what those are? What about the dependencies (direct and transitive) each of those services take on external code?

I think these are probably the most significant drivers for a security professional to gripe about microservices, but the more you distribute the more standardisation on the platform front you need. You don’t want to be reinventing the wheel, especially when it comes to security so the more sensible defaults you can bake into the platform the better and easier it might be to audit it.

Do we really need microservices?

I agree with the author that in some cases there can be “a dogma attached to not starting out with microservices on day one – no matter the problem“. Just because someone else (usually multi-billion dollar organisations with global footprint and tens of thousands of engineers, think FAANG) is doing microservices, doesn’t mean my 5 person startup also needs microservices.

But I have to add a bit of nuance here, an org doesn’t have to reach FAANG scale to realise they need to rearchitect. If my org is growing in terms of revenue, size and technology investment, then asking the following kinds of questions regularly is a part of engineering due diligence:

Is the current monolithic architecture with a shared database still the right thing to do?
Are we facing challenges in some areas where our current architecture is impeding our value delivery? If so, what might be some ways to alleviate that pain?
How much longer this system can keep growing with the same pace as the org and still be maintainable and agile?

Agile organisations and agile architectures are the ones that can evolve with time and need. The complexity of the architecture should be commensurate with the organisation’s growth rate and ambitions. No more, no less.

How web cos grow into microservices?

None of the web cos evolved to microservices over night, it was a long arduous journey over decades (far shorter than most employees’ tenure in an organisation btw). Here’s E-bay’s journey to microservices, here’s Netflix’s and here’s Amazon’s. In all cases you will notice that even though today they are microservice behemoths, they started the thinking and doing the groundwork many years prior when they were much smaller compared to today. Amazon for example started their thinking back in 1998, a full 25 years ago, which ultimately resulted in the manifesto linked above.

This is a testament to their forward thinking and agility that helped them survive and succeed, if they had waited until they got to today’s scale (assuming they ever managed to reach there in the first place), to start decomposing their architecture for growth and evolution, they probably won’t have made it.

So just fiendishly touting “there is nothing wrong with monolith” or “don’t do microservices” without justifying the arguments or clarifying the nuances is no different than someone wanting to have 1500+ microservices because someone else is doing it.

Look at where you are and where do you want to be

Its also true that many organisations are still monolithic-ish (from a technical pov) for example StackOverflow and Shopify and there’s probably more. But its not like StackOverflow will not ever entertain the possibility, they have multiple teams that are responsible for various parts of the SO so if they need to scale and increase the fault tolerance of a specific set of teams, they can always factor services out.

The article also gives examples of Instagram and Threads but what it omits is that Threads is built on top of Meta’s massive platform that is a collection of different and largely reusable services. Can you imagine the complexity of building something like that from ground up?

I can be pro-monolith and pro-large shared databases as an organisation as long as I regularly and critically review my architecture to sense signs of troubles and be mature enough to evolve it into a better state.

Problems with Distribution

Here’s where I probably agree somewhat with the author but I also think these are not problems unique to microservices:

Say goodbye to DRY

Somewhat yes but mostly no! It depends on what is being duplicated and can it really be considered duplication. If its knowledge of a domain concept that’s being duplicated then that’s bad and usually an indication of incorrect boundaries. If its data contract on provider and consumer ends, that’s not really a duplication.

This is also not a service architecture only problem, given a sufficiently large monolithic codebase (and depending on how well its modularised) I can bet you can duplicate knowledge in a monolith as well because in a rush to delivery, that’s just how engineers behave. Granted, it might be easier to spot and remedy if all code is in one place than if its spread across multiple codebases but then, that’s what you want to do even in a service architecture i.e. combine logically related codebases to reduce knowledge duplication. Nothing about many service architecture stops you from combining services when you need to.

Matter of fact, in my teams we’re simplifying our many-service architecture to a smaller set of carefully combined services. Note: services are not going anywhere, they are just getting a little less…micro. We are still working to decompose our shared monolithic e-commerce database by defining ownership boundaries around business capabilities

When combining services is not really an option, creating packaged libraries for common functionality and pushing them up to a central package registry for easier reuse, is the next best thing.

Developer ergonomics will crater

Yep! For new joiners in a team, even with all the support, guidance and onboarding, knowing the whole landscape can be quite daunting. And yes, over time you tend to start building a solid mental model and you can find the exact line of code in the exact service that lies on the critical path with a 2 minute Github search, but it still can be a long time before that happens.

Not to mention the time wasted just trying to get a service that doesn’t get changed often, to run and working on an engineer’s machine because people forget things they don’t look at and the environment also changes.

But once again, having a monolith doesn’t make it magically easier, especially if the monolith is sufficiently large. I would still need to make sure all the configuration for all the modules, is set up to bring up the system locally regardless of whether or not I need to touch that part of the system. With separate services, you only pay that cost for the module that you need to work on. Of course a lot depends on how the monolith is designed.

Integration tests – LOL

Yeah, kind of! But I would challenge this by saying that meaningful and fast integration testing in any sizeable organisation (think 40 different domains and 500+ engineers) long left the building. Integration testing though useful shouldn’t be the only way we test our code, monolith or microservices, because unless you are building your own payment gateway or geocoding platform, even your monolith will have external dependencies. You can forget about being able to do reliable and fast integration testing.

I would hate to see your testing code if the only tests you ever have are opaque integration tests with complicated dependency setup. How would one even reason about those tests? And if I can’t reason about them I would probably disable them or remove them or they’d get flaky overtime, in which case I’d have even less confidence to deploy changes. Having said that, the more dependencies you have (e.g. with microservices) the harder integration testing becomes but the same is true with “the more integration tests you have the harder it can be to maintain those tests”.

“observability” is the new “debugging in production”

Observability as a practice is not restricted to microservices or monoliths. Its just a sensible thing to want to do to get visibility into what the system is doing and how is it performing over time. It is essential for debugging production systems (mono or micro). You can’t step-thru debug code in production (though I have done it in past with Visual Studio Remote Debugging feature, back then it wasn’t a nice experience). Even if you could debug that way, the problem may not always be replicable in production because the environment is not 100% predictable and that’s why I rely on logs and metrics to observe the system performance over time and create a direction for my debugging or understand its rhythm.

No integration test can give you the profile over time that good monitoring does, because integration tests are a snapshot in time. Production is where the software is really tested, so yes I do want good observability in order to understand my system and troubleshoot it effectively.

What about just “services”?

Read on…

Services are about org design and business capabilities

Just because an org might not be planet scale doesn’t mean they can’t benefit from decomposing large systems into smaller ones to gain autonomy and resilience.

What an organisation should invest in is identifying how value flows through it, what people are empowered to and have the capability to, make decisions and draw contextual boundaries around those groups. Creating a stable platform that minimises reinventing the wheel, is also crucial as the org grows otherwise the amount of rework/grunt work across multiple services alone will be a drag and they will be writing about how microservices failed them.

The ideas in Team Topologies book talk about this kind of org design that allows a better implementation of Conway’s Law. Domain Driven Design talks about bounded contexts that create these relatively autonomous zones within an organisation that are coupled loosely from a functional and technical perspective.

Focusing on flow of value and organisation design should result in sensibly sized services that are driven by domain boundaries instead of technical wet dreaming. Micro, nano, pico or…mega…is irrelevant, because any change in the service granularity will/should be triggered by changes to the business value flow it delivers so a service should be as big as it needs to be. Splitting services for its own sake or combining services for its own sake (because you drank too much of the “nothing wrong with monolith” kool-aid), is ill-informed and cargo-culting. That’s a sure fire way to the madness the author is talking about.

How does one determine value flow and create better boundaries?

This needs its own post (or ten) so I will leave a cop-out list of other buzzwords to consider:

Value stream mapping
Big Picture Event Storming
Context Mapping
Domain modeling

N.B. Initial boundaries you will draw will probably be wrong, so be prepared to revisit and refactor them. You don’t want to stick with ineffective boundaries for too long.

In Closing…

Many-service architecture (I am not calling it microservices anymore), is definitely a scaling and optimisation pattern that shouldn’t be applied haphazardly or lightly just because you think it puts you in the cool kid category. It does add complexity because of many moving parts, increases the failure modes to consider and might even negatively impact the system performance
Pay attention to business capabilities and ownership boundaries (i.e. bounded contexts) by identifying flow of value in the org
Create services in correspondence to the bounded contexts and be prepared to redraw the boundaries and rearchitect both ways, that is:
- If you do this due diligence then you can even design a modular monolith to start with and split when actually needed, and
- Armed with those insights you can even combine multiple services into fewer to align better with the contextual boundaries.
Sometimes team reorganisation can cause reallocation of capabilities across portfolios, if you have a scruffy monolith then splitting out services to hand over might be harder than if you already had services.
You cannot have a loosely coupled services architecture if you are still sharing the monolithic database. If you are carving out services from the monolith, also take your data with you. Shared databases start out innocently enough when the org is small and simple but they are like bear cubs, eventually they get bigger, scarier, teethier and then they are no fun. Make breaking the monolithic database up a part of your engineering strategy
The organisation needs to have or be willing to build, a certain level of engineering maturity and leadership to execute a successful many service architecture evolution that is built on top of a stable platform
Thoughtlessly designed monolith is just as bad as thoughtlessly designed microservices.

Systems Thinking and Technical Debt

Aman Agrawal — Tue, 04 Apr 2023 07:00:00 +0000

I repeatedly see both business stakeholders and software engineers continue to struggle to see eye-to-eye on matters of technical debt despite the fact that both are impacted by it. I attribute this to the fact that both camps speak different languages and over the last 15 some odd years I haven’t found a silver bullet that can get a 100% alignment. Engineers are driven by:

Code complexity, maintainability and understandability
Making architecture more fault tolerant, resilient and quickly recoverable from outages
Keeping up with technological changes/staying on the cutting edge
Innate desire to improve software systems and not letting them rot

Business folks are driven by :

Investment vs return on that investment
Financial savings/profit
Time to market
Legal liability/other risk
Short term thinking and focus on features as opposed to long term outcomes

Its like two people arguing with each other where each speaks a language the other doesn’t understand! Never going to work! This InfoQ article tries to look under the hood of this communication gap between the two parties in more detail and makes some good recommendations, worth checking out.

The other problem that I think hinders alignment is a holistic understanding for how technical debt affects or is affected by, business drivers and having some way of visualising it. You can often sense a lack of this understanding when a manager says, “I don’t really see the business value in addressing this technical debt, right now we have critical functional work to do, can we do this tech thingy later?”. In this post I will try to use simplified Systems Thinking modelling language to put technical debt in the larger organisational context with the hope that it will make some sense to everyone.

Using Systems Thinking to Put Technical Debt in Context

I am going to take a crack at it by drawing a systems model using digital post-its connected by arrows (what else?). The post-its represent variables that can increase or decrease, green arrows mean a change in one variable results in corresponding increase in another variable and later on red arrows that mean a decrease in one variable triggered by a change in another variable.

DISCLAIMER: these models are abstractions of real life systems, so they are not meant to be 100% accurate but a useful approximation to help make sense of the complexities involved and connect them to the other parts of the organisational system.

For these models, I am going to use the following variables:

Number of business problems to solve/solved
Amount of business value created (somewhat abstract but let’s say its the measure of usefulness of the solutions that help improve the business outcomes)
Business success (EBITDA/revenue, new investment and expansions, new customer journeys, number of customers signed up, number of repeat customers, NPS what have you)
Business pressures (slow down in business success metrics creates pressure to do more)
Market forces (pandemic, war, supply chain issues, competitor action, economic turbulence etc)
Internal dynamics (org politics, reorganisation and restructuring, cost cutting, lawsuits, etc). Along with market forces, this generally tends to push down an organisation’s success.
Engineering velocity (roughly speaking, number of value add ideas productionised per cycle)
Engineering compromises (the number of shortcuts we take whilst productionising ideas)
Technical debt (well, I guess I don’t need to explain this, or do I? )
Engineer motivation and trust (mostly abstract but I guess WTFs per minute can be a good metric . In seriousness though, this erodes over time and can often be sensed when people abruptly leave or stop caring or become a very frustrated and a challenging member of the team.)

For the first diagram I am going to assume a perfect world where an organisation keeps going from strength to strength forever, and the engineering velocity keeps growing in tandem as well:

In a perfect world, business success and engineering velocity will continue to increase infinitely

Business opportunities generate business problems to be solved, the more problems we solve, the more business value we generate, the more the business will succeed, this means the pressure to succeed will increase in the form new revenue streams, new opportunities, new value streams, and the more demand this will force upon the engineering velocity which will respond by solving these challenges and generating more business value in turn. And the cycle will just continue resulting in an infinitely successful business and infinitely high engineering velocity with no technical debt whatsoever, its essentially a run-away positive feedback loop in systems thinking terminology. Of course, this is living in Harry Potter land, no relation to reality whatsoever!! So let’s descend down to reality, shall we?

In her book Thinking in Systems, Donella Meadows observes that:

no physical system can grow forever in a finite environment.

Meadows, Donella H.. Thinking in Systems

This is because an uncontrollably growing system eventually will tend towards instability and crash (the 2008 financial crisis is a glowing example of this runaway positive feedback loop, or, ever tried to bend an thin metal strip back and forth repeatedly until it snaps? Obvious, right?). In the light of this constraint, we can see that our model is missing other variables that serve to constrain the system so it doesn’t become a victim of its runaway success (or failure). So what would the picture look like with all these variables plugged in?

In a more realistic world, we need other variables that constrain the system

Suddenly the complexity explodes!

As we solve more business problems, the more value we add, the more the business succeeds which increases the pressure for sustained success, because…let’s not get complacent, yes? Internal dynamics such as reorganisation, politics etc and market forces such as pandemic, competitor actions, societal upheavals like wars, high inflation etc, push against the business success, this creates even more business pressure to succeed and pressure to increase engineering velocity to reduce time to market and gain competitive advantage.

Up to a certain point the velocity will grow very organically, but now that we know that no system can grow forever, eventually, the high demand on engineering velocity will result in more and more engineering compromises and shortcuts to be made. This will in turn increase the technical debt accumulated which initially will increase the velocity but with enough of these kinds of iterations, it will start to wear down the engineer motivation and trust in the system and the team as they struggle with the past engineering compromises and in the race to deliver faster will end up adding new compromises and debt on top of the existing ones. This also increases the maintenance costs of the software and eventually it will start to slow down the engineering velocity. This means a reduction in the number of business problems solved, more of the org’s investment in engineering goes towards just struggling with the technical debt and not adding new value. This in turn results in that much less value being created over all which will eventually start to reduce business success.

If left unchecked (in some cases this does happen), this cycle could also go in a runaway negative feedback loop where an org’s engineering capabilities are actively hindering its success as opposed to enhancing it. This erosion of value creation doesn’t happen over night, it can take a long time (often years) to build up but in the end its like the org is paying the engineering teams to actively sabotage itself, that’s horrifying! But since no system can grow or shrink forever, interventions will be made eventually to salvage the situation which will inevitably lead to Big Bang Rewrites of all the “legacy” systems. This will create its own problems (not represented in the diagram), for e.g. the time and money cost of rewrite will further erode the business value proposition of the system because value won’t be created until the first version of the “new” system goes live, leading to less business success, increasing costs, increasing management pressure to deliver successfully this time around. But since we want to go fast, we’d take shortcuts and compromises which starts the vicious cycle all over again just in the “new” system.

Can this be considered a smart business strategy?

In systems of comparable complexity, refactoring a system gradually towards health and improved design is generally a lower risk and rapid return investment than rewriting it from scratch (though in some cases, the opposite is also true). This is because much of the original investment (and knowledge) can still be valid and preserved and you are not rushed to finish the work for the fear of blocking the creation of new value. Old and new (if refactored carefully), can happily co-exist with every iteration not only creating new value now but also reducing the technical debt that we’ve accumulated along the way. The old can then be decommissioned eventually.

But I digress a bit, so what’s the solution to minimise the run away negative feedback loop then? We don’t want to pack down our business just because there are constraining forces at work, yes? How do we try to create a harmonious balance of short term debt advantage and long term stability and resilience of the system? In finance, the bank’s enforcement agents or government penalties, will solve that problem for you real quick, but unfortunately for most enterprise software engineering, we don’t have that level of “encouragement”.

…

How about Engineering Discipline? Sounds obvious but how will it fit in the model? Let’s see:

Introducing a little bit of discipline (top right corner-ish), can bring back some stability over time

When the velocity starts to drop and more engineering compromises are made to increase the velocity (see the irony here?), engineering discipline can act as a compensating driver that can make sure that we are able to reduce the previous compromises first before we add new value every cycle. This gradually results in increased engineer motivation and trust as they don’t need to struggle with bad decisions as much and this will serve to increase engineering velocity and value creation over time. The increase in the engineering velocity may not be by leaps and bounds and not right away, but at least it’s likely to not fall too low in the face of ever increasing business pressures, and the system won’t spiral into madness.

Like I said before the model is not perfect, no model is, but I think (and hope) it helps put the complexities involved, in perspective for both engineers and business stakeholders and clarifies the affect long term accumulation of technical debt can have on the business outcomes.

Engineering Discipline is critical to bringing the system back into equilibrium and this is why its important for engineering teams to take control and ownership of this discipline and be proactively on the lookout for variable changes that tend to push the system towards instability. We don’t need permission to do the right thing, we have the engineering expertise and experience to know when and how the right thing should be done because we also understand the long term implications of neglect. Though we do need to communicate about these implications to the business in the language they understand to the extent its feasible and possible_._

If you have tried similar tools to communicate the value of addressing technical debt or if you think this model could be made more convincing or more “correct”, please drop a comment! Cheers!

On Software Architecture Decisions, Evolution and Engineering - 4

Aman Agrawal — Wed, 14 Sep 2022 16:40:43 +0000

Our Architectural Decision Making “Process”

What follows below is essentially a written version of a work in progress practice that we’ve been following for a while already in my domain, not just for new engineering work but also for architectural refactoring work:

Engage with relevant stakeholders to understand the scope and context of the problem space as soon as you can
Try to come up with up to 3 draft solutions each with pros and cons so we can make informed decisions without rushing with the first idea
Outline potential risks/failure modes inherent in these options.
Engage with the business stakeholders and users, to assess the impact of these failures on the business/users and identify the top risks to mitigate. Don’t assume anything! These discussions take the form of “what-if/what-when” type questions (as highlighted in the previous section). This pushes the stakeholders and the users to think deeper about the problem and give us pragmatic and honest responses.
Based on these conversations about risks, fine tune the architectural options for the desired quality attributes for e.g. for reliable messaging we might go for the TRANSACTIONAL OUTBOX pattern, if the users have to be kept up to date with data changes then we might opt for the PUSH NOTIFICATION pattern using one of the server push technologies etc.
- Additionally, there might be risks that engineering teams might perceive as well for e.g. component complexity, duplication of behaviour, ownership, security, maintainability and testability etc. We also address these risks. Its important that these mitigations don’t adversely affect the observable business behaviour of the system. For e.g. if we choose to use SERVERLESS FUNCTION style – in an effort to reduce infrastructure maintenance overhead – to build a solution that should either be limited for concurrency or could be a long running operation, then this choice could either result in operation failure mid-stream due to time out or could produce incorrect results. Both affect the observable behaviour of the system from a stakeholder POV and is therefore not the right style to apply. We strive for a balance between engineering and business stakeholder expectations because both are important.
- When uncertain or want a second opinion, we reach out to other teams in the organisation to have them feedback on our design. I’ve created an Architecture Working Group in my organisation whose purpose is this very cross-collaboration and pollination of ideas to share learnings with each other. We’ve had good feedback from teams so far that have participated in these sessions and their understanding is better for it. Once we have mitigation plans for the most important risks, we pick the solution that minimises most of them. If two solution options tie, we pick the one that has the lowest implementation complexity and/or financial cost.
Document all the discussed risks and mitigations in ADRs (Architecture Decision Records) and start the engineering iteration
Repeat for each major product increment or design refactoring work

The goal is never to think up and address all edge cases that could exist upfront – that’s just not possible – but it is to think of and address the most pressing ones from both engineering and business points of view. One useful heuristic for uncovering the failure modes, is to look at the lines connecting the boxes in an architecture diagram as opposed to focussing just on the boxes and asking yourself what-when/what-if questions like, “what happens when this connection fails?” or “what happens when this message gets delivered multiple times and/or out of order?” or “what happens when 2 out of these 3 operations fail?” or “what happens when the box on the other side is not available?” or “what kinds of security risks are we being exposed to by exposing an API to third party?” etc.

This technique has helped us build more pragmatic designs where we have been able to reduce accidental complexity whilst designing for the essential complexity and critical risks. We’ve also improved our communication with our stakeholders a lot as a result which has led to a lot of good business and technical learning for both business stakeholders and engineers. With all this learning and a mindset for continuous improvement we also hope to keep improving the technical and strategic quality of the products we build. In the end this is what agility is all about!

Takeaways

Understand the business context, partner with stakeholders early and often and make architectural decisions proportionate to the most critical risks involved.
Engage in modeling exercises like Event Storming often to build up an understanding of the business process and map it to the software process.
Assume nothing! Push back against overtly confident predictions and assumptions no matter who they come from
Product Managers’ job is the what and the why, engineers’ job is the how and both have to collaborate on the when.
Build for known knowns, plan for known unknowns and adapt when unknown unknowns hit
Document your decisions, trade-offs and risks diligently and regularly review them to find improvement opportunities.
Rinse and repeat. You are in this for the long haul!

On Software Architecture Decisions, Evolution and Engineering - 3

Aman Agrawal — Wed, 14 Sep 2022 16:40:30 +0000

How much architectural work should be done upfront?

Almost every engineer – consciously or sub-consciously – wants to make the best architectural decisions from the get-go (hello!👋). Business will want systems that are 100% consistent, 100% reliable, 100% bug free and cheap as chips (of course!). We want our designs to stand the test of time so that we only ever extend them cleanly by adding new components as opposed to changing the existing components. We are afraid to paint ourselves into a corner with our decisions (in architecture and in life), yet the reality is that we won’t know today what we will know tomorrow so we have to make the best decision for today and keep evolving them based on evolving context. This last part is where many teams fall by the way side (mine included), all good intentions labeled “we should fix this!” albeit with unarticulated consequences, get buried down in the bottomless pit that is the JIRA backlog competing with feature work because we’ve made peace with the status-quo in the absence of irrefutable, cold, hard data that will convince the business enough to want to prioritise something.

This unsettles me, so I will come back to this in a bit!

But, how do I make the best decision for today? The goal of architecture is to provide a safe, flexible and reliable space to solve real business problems over time. Every problem/solution domain will have certain risks associated with it. If my decisions are reducing those risks in a meaningful way whilst solving the problem, then those are the best decisions for today.

The book Just Enough Software Architecture- A Risk Driven Approach, re-iterates the well known albeit somewhat flawed risk formula:

Risk = probability of failure x impact of the failure

In complex distributed systems architecture the probability of failure is never going to be zero and in modern managed cloud systems, reliable and available though they are, we often don’t have much control over them, so the only variable that we can somewhat influence is the impact of that failure.

In other words, by incorporating faster recovery patterns into the architecture, the impact of a failure can be reduced thus lowering the overall risk. For e.g. by setting up a dead letter queue and establishing a requeuing policy, the impact of failed messages can be reduced because we can recover from such a failure without losing data. With retries and circuit breakers, we can achieve resilience and fault tolerance between services. Some recovery measures are also built into the business processes, for e.g. if the supplier delivers more stock than was requested in a purchase order, then the warehouse might still accept the stock but make a record of a conflicting delivery, so the impact of incorrect delivery is lowered. Or if we accidentally charge customer’s credit card twice, we might issue a refund after the fact or a compensatory gift voucher for the equal amount. Needless to say these are edge case scenarios, not the norms!

Through conversations with the business stakeholders, various risks/failure scenarios can be identified, brainstormed and prioritised. Then its a matter of mitigating the top most risks from both engineering and business pov. Purely engineering risks (for e.g. lack of experience in a specific technology, lack of tooling support, codebase with poor maintainability and evolvability, difficulty of integration with legacy systems, hard to evolve architecture etc), will still have to be mitigated by the engineering team because they may indirectly end up impacting the business process.

To have these conversations in a meaningful way, the JESA book recommends that the risks be described as a testable failure scenario that the business stakeholders can also relate to, for e.g.:

Two or more users editing the same purchase orders near about the same time, could result in one of their changes being lost.

The reason for this, is identifying the failure then becomes easier i.e. lost data due to concurrent update which then makes reasoning about the probability and the impact of that failure more concrete. The stakeholders might say, “Oh the users are only allowed to edit purchase orders they created, not any purchase orders so the probability of the failure is low” and this will also pull down the impact score, or they might say, “Its ok, if they end up stepping on each others’ toes, then they are supposed to resolve it amongst themselves” and whilst this increases the probability of failure, it reduces the impact of it this keeping the overall risk lower.

But if they say, “Yeah, we should really try and not have that problem happen or at least make the users aware that the purchase order has been changed by someone else so they can decide accordingly. Otherwise we might be communicating incorrect data to suppliers which will affect stock levels in the warehouse negatively” then we know that both the probability and the impact of this failure are high, thus increasing the overall risk score and this makes it a high priority risk item that needs mitigation in the architecture.

Sometimes it won’t be that straightforward because the engineering problem is too far away or too low level from stakeholders’ mental model of what it takes to build a scalable and useful product. The conversation will inevitably hit a translation barrier and you will either get a rejection outright because they don’t understand the value and we often fear what we don’t understand, OR, you will get a disinterested response like, “do whatever you need to do to make it future proof!” without fully appreciating the gravity of that statement.

This is where I prefer to not bother them with the technical details beyond what’s directly relatable to them but instead focus on what my proposal enables in general terms. For e.g. if I need to get some buy in to refactor one big service into smaller services, I might present the arguments around one part of the business process not going down when this other service goes down or being able to support new use cases more rapidly and improve the overall reliability of the process. Other times, the cost of translating a technical change to business enablement scenario might just be too great which means I’d then lob that change under the umbrella of “engineering maintenance and updates” and that’s usually all the stakeholders will have the patience for anyway!

Point being, this risk based approach to architecture grounds the decision making in reality and pragmatism than the engineers’ own perception of some hypothetical risk or dogma driven assessment of it, which could result in over-engineering and not addressing the important risks enough. The only exception here is that if you (engineers and stakeholders) know of typical risks in the domain you are operating in, then you could mitigate those risks upfront. For e.g. in my domain, we’re aware that data demands of the stakeholders could induce tight coupling with specific databases that we don’t own so we have to pay attention to that engineering risk in our architecture and enforce ownership boundaries a bit strongly.

Its obvious but worth mentioning anyway, no matter how good a decision you think you are making today, it will always have trade-offs and will never be perfect today for everything that will be thrown at it over time. In all the above examples I’ve highlighted, I am gaining something at the cost of something else, for e.g. by adding queues, I am gaining fault tolerance and availability and accepting complexity and a different programming model as trade-offs. By opting for serverless paradigm I am gaining better scalability, lower maintenance overhead, smaller application footprint but accepting the trade-off of the server being a black-box, harder to debug platform level issues, having to design the applications to fit within the resource constrained environments they represent, having to think about concurrency a lot more than other more conventional compute environment.

This is why having a discipline around evolving designs with changing contexts or when you know better, is not only critical for a healthy product but also critical for team morale. Teams should feel rewarded for wanting to improve things, not ignored or being jerked around because “we have higher priority things to do, we can always do this later!”. Agility is first and foremost about feedback loops and continuous improvements, a sprint is an experiment from the outcomes of which also come learning and knowledge that helps us get better in the next round, this also happens to be the core of engineering! Its important for Product Management and Engineering to work together as partners in successes and failures, as opposed to one side working against the best placed intentions of the other.

How should architecture evolve?

Whilst not a given at any level, its often easier for organisations that have a level of leadership support around engineering that helps lay the foundation for good practices and build up on them all the whilst helping engineers get management buy in. Some of these practices are codified as organisation wide Engineering Principles and Strategies which teams adopt, improve upon and drive forward. This makes it easier to use these principles as a guidepost for making architectural (refactoring) decisions during or after iterations. As a Principal Engineer, I am always looking to hoist good practices into our org wide engineering principles and create guidelines around some of these.

At team levels I have also helped establish an architectural vision and strategy to help set goals for architectural evolution in a piecemeal fashion. As a follow on, I also help teams do regular architectural reviews and identify areas of improvements going forward. This helps us in helping Product Owners to understand the proposal enough to put it on the roadmap accordingly. In the end we are all wanting to do the right thing, we only need to be more accommodating of each other’s perspectives. This can be hard at times, and I falter as well but understanding that we will always have more to do than we have time for so we’ve got to prioritise, helps me refocus and persevere on critical items.

Another heuristic I practice and encourage others to do as well, is to always make the best engineering decisions we can for today in accordance with our engineering principles but also document ways to evolve the design in the future based on the signals we get from the context at the time. In other words build for known knowns but plan for known unknowns so we can avoid unnecessary and avoidable re-work. Both these methods ultimately aim for continuous improvement – or Kaizen (as mentioned in the Toyota Production System philosophy)

Command Query Responsibility Unification
For one of the product increments we delivered, we made a decision to split the reads and writes between 2 services in a CQRS style architecture to optimise for data reuse. However, over time we had problems where the read side would often "get stuck" on a stale version of the data (story for another post). This led to constant user complains, degradation of trust and dissatisfaction.

We had documented this possibility from the very beginning, so we decided to forego the CQRS style in favour of reads and writes both going to the source of truth service which increased the likelihood of recovery from out of sync data, given both operations are happening against the same database.

Sometimes dealing with real failures is the only way to identify what system should evolve, how should it evolve, by how much and what are the trade-offs but these will be cases when you don’t know what you don’t know (i.e. the unknown unknowns). It can also help put things in perspective for the business to want to put corrective actions in place because the consequences become more tangible. Its like breaking your bones to have them heal stronger and more robust, but I would prefer to not have to break bones in the first place if I can avoid it. It also makes for a very painful long term strategy!😉

I have had discussions with Product Managers in the past where that was their evolution strategy. I wonder if that’s their philosophy for things like car maintenance or looking after their health. I can’t imagine that they would only change their lifestyle and eating habits when their doctor says, “change or die!”. Would I call this smart? Would you?

Blocked Procs
About half a decade ago we (my predecessors and with good intentions I am sure) created several stored procedures in the org wide shared database to be able to do something at regular intervals and for a long time it kinda worked, until about last May when we suffered a 28 hour long outage of our purchase ordering process. Much of the cause lied in these stored procedures that updated a large number of rows which caused long running transactions and excessive row locking which ultimately caused the process to start timing out without completing. It also started affecting other critical business processes in the organisation. It was very difficult for the team to debug and diagnose the issues with any certainty (some of it has to do with the way we are set up as an org)!

This helped crystallise the problem enough for the business to want to invest some time into improving this part of the architecture leveraging an asynchronous event driven paradigm as opposed to scheduled batch type process that can be harder to recover from. There are other trade-offs in this approach but with logic being in application code, the odds of faster recovery go up. It also helps make the design more explicit, context boundaries clearer and domain concepts being represented in programming language code as opposed to buried in a stored procedure also make it easier to evolve the system without affecting others.

According to Brian Foote and Joseph Yoder:

If you think good architecture is expensive, try bad architecture.

The way this usually plays out is management (and sometimes engineers) will often keep putting off essential refactoring and architectural improvements because they see them as additional cost overhead which doesn’t benefit users and all the while, new stuff is being forced around the convoluted design leading to an even more difficult to evolve product. Eventually it reaches a point where even a small change causes something else to break entirely or even worse, even changing the colour on a button looks to be becoming a multi month project. At this point an external consultant is called in who charges the org a big fat wad of cash and comes to the same exact conclusion that the engineering team had been saying all along and then the management comes up with a revolutionary idea to fix it all: REWRITE (i.e. before they decide to outsource it all to a low income country in order save costs, and the whole cycle repeats). True story, I kid you not!

There are other examples of similar or worse debacles, Healthcare.gov hellscape, Knight Capital nightmare, Volkwagen’s vandalism, UK Post Office’s poop-up? The internet is chalk full of them if only one cares to learn from them. These issues are not even unique to software, but it happens to be the easier one to mess up because its soft and we’re an immature and unregulated industry! But…that is a matter for another post!

My preference (as is the preference of most professional and responsible software engineers I have spoken to) is to intervene before we reach these extremes, before we break our bones painfully, before our car conks down in the middle of nowhere and before our heart explodes in our chest. My appeal to all the engineers is to keep pushing and keep challenging assumptions, and not get “influenced” by management speak. You would also need to sharpen that ability to zoom out from the purely technical to the intersection between business and technical to be able to effectively negotiate a positive outcome. This is also hard but you must keep in mind that as long as the incentives and goals align both sides have the negotiation power. The key is to make sure that both sides gain something from the negotiation for it to have any future longevity. I falter here as well at times but I keep reminding myself of why are we doing this.

On Software Architecture Decisions, Evolution and Engineering - 2

Aman Agrawal — Wed, 14 Sep 2022 16:40:01 +0000

What is an architectural decision?

A decision you and your team make about the “important stuff” or the “architecturally significant” elements of the system you are building. According to Grady Booch, “…significant is determined by the cost of change!” But what is this “cost of change”? I would argue any effort you spend in re-working a solution constitutes “cost of change” for e.g. major refactoring to address technical debt or having to rearchitect the system significantly simply to support new business requirements. Lean manufacturing folks will tell you that rework is waste!

Sometimes it can be financial costs in addition to re-work cost.

Cache In, Cache Out
Several years ago, we chose to use a Redis cache cluster on AWS for one of our mission critical services that only runs for a couple of hours a day, and doesn't really have low latency requirements. The reason for that decision are lost to history, but the cost of that decision ended up being monetary because this cache cluster was underutilised and over-paid for.

We replaced it with a simple DynamoDB table with on-demand pricing which ended up being a more cost effective alternative with no performance penalty (we're talking a few thousand records, so performance is not even a concern here). Doing so of course also came at a cost of rework because we now had to skip expired records explicitly, something which Redis had been taking care of automatically. This being on the critical path also increased the risk.

Another thing that can make a decision architecturally significant is its potential effect on the desired quality attributes of the system (i.e. the so called -ilities). If I choose synchronous communication over asynchronous then I might be affecting recoverability, availability or performance of the system due to point-in-time coupling with another service.

By introducing a queue however and making things asynchronous, I improve those characteristics but introduce complexity

Or if I choose to not make a workflow transactional (i.e. all or nothing), then I could be affecting reliability and recoverability attributes of the system due to the lack of consistency between state changes and side-affects. The independent failure paths there could make it harder to recover from a failure for e.g. what happens when creating purchase order fails but supplier notification succeeds or vice-versa?

Or if I tightly couple the various components/modules of my application architecture or add more unrelated responsibilities to them, I could be affecting long term maintainability (and all its sub-attributes defined in ISO-25010). If I am rolling my own cryptography libraries then I might be compromising the security of the system!

Domain model is another architecturally significant decision, perhaps the most crucial of all because it is the thing that expresses the solution to the business problem and lies at the core of your system that everything else depends on! An unsuitable model will therefore affect everything else! Consider the following generic Hexagonal Architecture diagram:

The domain model is what drives all the subsequent activities in the system so if that’s not fit for purpose, then those design inadequacies will radiate outward to the rest of the system! Its like building on top of a shaky foundation!

Case of the Stuffy Domain Model
We once spent 4 weeks to do a domain model rewrite for one of our systems, because we didn't spend enough time modelling the solution domain appropriately and isolating it from the persistence concerns. Our MVP mindset influenced us into making too many assumptions about the data access patterns and they ended up leaking all the way to the frontend application. Our System 1 response to changing requirements had been to just stuff everything into the same monolithic "domain model" we had and "fix it later", than to take a good look at it and ask, "is it still fit for purpose or should we remodel?". Whilst we managed to complete the rewrite without any outages, we gambled with a huge risk due to the large change surface area of the rewrite.

Time spent understanding and modelling the domain well enough is time well spent. We are not chasing 100% perfection and the model is going to have to evolve but we must not be sloppy with the design either. There are lots of great domain driven techniques to do this: Event Storming, Wardley Mapping, DDD Whirlpool, Context Mapping. Its an effort worth spending time on some of these.

All models are wrong, some are useful!

George E.P. Box (Journal of the American Statistical Association, 1976)

Service ownership boundaries, in the same vein, are also a hugely significant element of your architecture and one that is more about people than technology. Creating wrong boundaries or worse, not having any boundaries is a cost you will be paying for years to come as systems and dependencies evolve around shoddy boundaries and harden them.

As a Principal Engineer, I facilitate these practices as much as I can and strive to make boundaries and contexts the first class citizens in any engineering discussions that I engage in with teams. Its a part of our overall engineering strategy, a long term struggle but a worthy one.

On Software Architecture Decisions, Evolution and Engineering - 1

Aman Agrawal — Wed, 14 Sep 2022 06:00:00 +0000

⚠ If you are looking for a quick read, this post is anything but. Make sure you’ve grabbed a ☕ or two before you decide to read this! This topic is very important to me, I have strong opinions about this and I’ve been meaning to put my thoughts down for a long time! Real life case studies and examples appear in quoted text with cheesy titles.

In my 15 some odd years of practising software engineering, I have come across two extremes around software architecture and perhaps unsurprisingly so given I cut my teeth in the waterfall era: Big Design Up Front (no code until the design/architecture is nailed down to the last detail) and No Design Up Front (we follow agile so we are just going to get started hacking away at the code until something resembling an architecture emerges if at all, and if it works that’s a bonus!). Move fast and break things sort of mindset.

Neither is conducive to the long term health and the value of the system nor to the morale of the teams building them, and MUST be avoided! In the former, you never actually ship anything meaningful and in the latter, you get pwned by your “architecture”! There are some environments where the latter works well enough, start-ups, but here I am talking about my home turf – enterprise software or software systems built at large organisations with often bureaucratic management practices and mis-aligned incentives between engineering and business teams.

My current philosophy is that deliberate and thoughtful architectural decisions that solve real business problems and mitigate enough of the real risks, is the RIGHT and the MIDDLE PATH way to go. In this post I want to explore the following questions and demonstrate using examples how we make risk driven architectural decisions in my domain at work and how I as a Principal Engineer facilitate this practice amongst my teams in the hopes that it will resonate with others as well. There is also some helpings of grumpy old man-isms sprinkled here and there:

What is an architectural decision ?
How much architectural work should be done upfront ?
How should architecture evolve?
Our Architectural Decision Making “Process”

Using C# Source Generators to Generate Data Transfer Objects (DTOs)

Aman Agrawal — Thu, 29 Jul 2021 05:24:39 +0000

For many enterprise applications, there would normally be a split between domain entities that live inside the application core and DTOs that are exposed to the outside world, for e.g. as outgoing data structures from a web service. Often these structures are symmetric to domain entities, i.e. they contain all of the same properties that domain entities contain but most likely none of the logic. There are 2 problems with this pattern however:

This tends to be a very manual and time consuming task that doesn't add much value
The code you have to write to map from the domain entities to these DTOs tends to be repetitive and error prone. Mistakes can often occur leading to broken contracts with the consumers of the DTO.

Using C# Source Generators to generate DTOs could potentially save a lot of developer time, so in this post I am going to attempt to write just such a generator.

DISCLAIMERS:

This may not be the most performant way or the most sensible to write source generators so if you know of a better way, please by all means comment away!
The code shown here will very likely be very hard to follow at times especially in the last half of part 2, its OK if you don't follow all of it. You are writing code within code essentially, so it is a bit messy by design. I will put the fully refactored code on Github with a readme that describes some of the Roslyn APIs so that should help a little.
The whole thing is a bit of an experiment, I've not used this code in any production application yet. I intend to at some point to get some real feedback on its viability but nothing as yet. For now I am really curious to see what's possible and how far I am willing to and/or comfortable, pushing it.
Implementation is a bit opinionated so it will not cater for all edge cases.

ASSUMPTIONS:

No inheritance relationship between domain entity classes. All entity classes are therefore assumed to be at the same level of hierarchy.
Either the whole entity class will be mapped to a DTO or not at all, excluding properties from being mapped whilst possible is not in the current scope.
DTOs are assumed to be outgoing from, say a web service or a REST API so persistence related models are out of scope for now because they might require additional data access related code to be added to the DTOs for e.g. special attributes specific to ORMs, access modifiers etc which for a generic code generator is too much responsibility and will make things too complicated.

What is source generation?

At its most basic, source generation is basically what it sounds like: auto-generated code. There are two main types: compile time and post-compilation or IL emitting. Former ties into the existing build toolchain and the latter is a bit more challenging because you need to know IL and emitting correct IL is hard if not impossible. I could imagine both being relatively difficult to test using conventional testing techniques. Usually a lot of trial and error might be involved and there might often be limited support for debugging. Source generation is not a new concept, its been around for a long time in the form of tools like T4, Postsharp, Fody etc but I have never used these tools in the past...well...except for may be T4 several years ago...once.

C# Source Generators allow emitting C# code during the compilation process and include the emitted code in the rest of build process such that it builds along with the rest of your code. The compiler passes the control to the ISourceGenerator implementation to add code to the syntax tree and the emitted code is then included in the rest of compilation process as normal. This blog post goes into a lot of the "whats" and the "whys" so I am not going to.

Compiler 101

The C# compiler creates two models from the code you write: syntactic model i.e. how is the code structured in terms of tokens for e.g. starts with access modifier, then return type then identifier then parenthesis etc basically your cs file and semantic model i.e. what does the code mean for e.g. what is a property? what are the type arguments for a generic type? etc. During the compilation process, the parser parses the code into a syntax tree and generates the semantic model for this tree, these two models are then passed to your source generator to scan, and emit code based on criteria you define. For e.g. generate DTOs for all entity classes decorated with a certain attribute.

Generating DTOs from Basic Domain Entities (only primitive types)

First things first, I will add a console app (call it ConsoleApp9) to a new solution where I can define my domain entities, and later will reference the source generator to do some code generation.

For this blog post I will create a basic Employee domain entity that looks like this:

The corresponding DTOs might look something like this - basically just open property bags:

How do I get from the domain to DTO ?

The next thing I've got to do is create a new .NET Standard 2.0 class library project in my solution that I created earlier. This library project can then be shipped as a Nuget package later. NOTE: I do need all those Microsoft.CodeAnalysis.* packages to create a source generator. The C# language version has to be latest and in order to see what files the compiler outputs, I'll set the EmitCompilerGeneratedFiles attribute to true and specify a folder for the generated files to go into (last two lines in the following csproj snippet).

Now I need to create an implementation of the ISourceGenerator interface in this project and decorate it with Generator attribute for the compilation process to pick it up as a source generator.

What I want to do in order to help the source generator find the types that need converting to DTO, is decorate my domain types with a custom attribute which I will create in the source generator project:

and then tack it on the my domain entities that I want DTOs for:

In order to tell my source generator which classes to generate code for, I need to also implement ISyntaxContextReciever and register it with the source generator (in the Initialise() method). This receiver is kind of a hook that the compiler calls into as it traverses the syntax tree node by node:

During each visit, I will check to see if the TypeDeclarationSyntax node (for e.g. a class or a struct) is decorated with the GenerateMappedDto attribute and if it is, I will simply add it to a list. That done its time to register this receiver with the source generator:

TypeDeclarationSyntax is the base type for the ClassDeclarationSyntax and StructDeclarationSyntax so will cover both container types and will allow generating DTOs for both. Once the entire syntax tree is thus visited, the control will be transferred back to the source generator and the compiler will pass it both the syntactic model and the semantic model using which I can write out the actual code:

There is a quite bit going on here so I will unpack, I am using the semantic model for the most part because that gives me richer set of information about the code:

I'd also like, for simplicity reasons, to put the DTOs in "Dtos" namespace under the main domain namespace. So if the entity is in MyApp.Domain then the dtos will be in MyApp.Domain.Dtos (Line 17)
I will like the DTOs to have a naming convention of "{Entity class name}Dto". For e.g. Employe entity class will have a DTO class named EmployeeDto. (Line 20)
I am importing some generic framework libraries. Nothing fancy about that (Lines 23-25)
Then goes the meat of the class body (Lines 32-41) (more on this in a sec!)
Close class and namespace definitions.

So what's the meat here?

Essentially, each property in the DTO will be mapped to the corresponding property in the domain entity, with the same name and type. So I am looping over all the PropertyDeclarationSyntax nodes in the domain entity syntax model and emitting corresponding properties. In the BuildDtoProperty() method, I am once again using the semantic model to get more information about the property type which is available via property.Type property. Then I've just added a simple extension method to get the condensed name of the type for e.g. Guid instead of System.Guid (just personal readability preference).

It literally emits: public Guid Id {get;set;}

BTW, all these various syntax/symbol classes are a part of the Roslyn C# syntax API that you can browse here.

That...is basically it for a very minimal (read: limitedly useful) DTO generator!

How do I actually use this generator in my application?

Remember our ConsoleApp9 that we added earlier? I will now add to it a project reference to the source generator project:

Two things of note here:

1) I have added the GenerateMappedDtoAttribute class as a linked file temporarily, eventually this would be a part of the Nuget package so the linked file reference can be removed from the ConsoleApp9.csproj, and

2) the project reference doesn't reference the output assembly and also sets the item type to Analyzer. The former will make sure that any transitive dependencies of the source generator project don't get added as references of the console app project itself (this attribute is not needed when adding a PackageReference) and the latter will make it appear as an analyzer under dependencies and this is also where I can see the generated DTO.

Now I am all set to generate my very first DTO automagically! I'll just hit Ctrl+Shift+B in Visual Studio (or do dotnet build from the CLI), to build the solution! If you are doing this for the first time you might notice that nothing seems to have changed in ConsoleApp9! 🤔No DTOs in sight, nothing and if you are unlucky there might be some build errors to boot as well, what's that about?

Well, this is where we might want to pay heed to the recommendation of source generator creators:

You will need to restart Visual Studio after building the source generator to make errors go away and IntelliSense appear for generated source code. After you do that, things will work. Currently, Visual Studio integration is very early on. This current behavior will change in the future so that you don’t need to restart Visual Studio.
- Microsoft

I don't think VS Code suffers from this issue and I haven't tried this on Rider, unfortunately, my trial expired before I embarked on source generators so that mystery will stay a mystery for now!

Once I do the proverbial "turn it off and on again", I now see the DTO light up in the consuming project and intellisense should also pick it up. Now bear in mind, these generated DTOs don't get checked into source control because..well..they get generated at build time!😁But they will get packaged into my application binaries during deployment, so I can rest easy!

How do I Nuget-ify my source generators?

For this I will modify the csproj of the source generator to:

Add a package version (come on, we're not animals!)
Instruct the dotnet pack command to put the analyser in a pre-designated folder in the generated nuget package (I ended up spending several frustrating minutes trying to figure out why the analyser was not showing up in my consuming project, this turned out to the missing piece! Now you know as well!)

Once this package is pushed to nuget feed, I can replace ProjectReference with PackageReference in my ConsoleApp9 project, remove the ReferenceOutputAssembly attribute from it and I'm off to the races! Now whenever I add a new property to my domain entity, all I have to do now is build the project and the DTO will be automatically updated. That's a lot of developer time saved potentially!

What's missing?

As cool as this was, the generator is very basic in that it doesn't support:

Properties with complex types for e.g. Employee class having a Address type property which in turn could be composed of primitive types.
Properties with generic types with one or more primitive type arguments for e.g. Employee class having an IReadOnlyCollection<string> property called Dogs (may be for some godforsaken reason we want to track the names of their dogs! I know you pooped on the carpet, Winston! 🤷‍♂️)
Properties with generic types with one or more complex type arguments for e.g. Employee class having a IReadOnlyCollection<CompanyAsset> called AssetsAllocated or some completely made up property of type Dictionary<int,Spaceship>, and
Generating mapping extension methods to convert entities to DTOs. This could potentially be a bigger win in terms of time saving!

These I will cover in the final part of this blog post!

Header image source

Checkout part 2 here

Using C# Source Generators to Generate Data Transfer Objects (DTOs) – Part 2

Aman Agrawal — Wed, 28 Jul 2021 10:00:00 +0000

In part 1, I created a very basic DTO generator that could only work with primitive types. In this final and very looong part, I will try and extend it to be more useful by supporting generic types, complex types and generating mapping methods.

First though I am going to tackle the mapping extension methods because that can enhance the usability of the current generator quite a bit with minimal work (ye’ old 80/20 rule). What I am after is something that looks like this:

This may not be uncommon for a mapping method, I have written tons of mappers like this and from experience I can say unequivocally, they never get much smarter than this with the exception of, null input handling. There shouldn’t be any smarts in the DTOs or the mappers anyway, its an anti-pattern and a design smell because DTOs are only meant as data vessels that get serialised over the network. Nothing more!

To keep things clean, I will remove the code that I had already written for the basic generator and simply add code to the end:

Much of the code should be pretty self-explanatory, I am simply generating a static class with an extension method in to convert from the domain entity to the DTO but let’s unpack:

I am all for contextual names for classes and functions etc, but in this case if I just give the class the name EntityExtensions or something along those lines, then the names will clash with the other extension classes that I will create for other complex types later on. Its possible to put all extension methods in one class but for now, I’d rather keep them per DTO. The impact on compilation should be minimal so there is little incentive to bung them all in one class. Therefore, I am just going to append a “-” stripped Guid to the class name so they are all unique.
Next I will define the signature of the extension method which accepts an instance of the domain entity type and returns an instance of DTO type. The TypeDeclarationSyntax instance will give me the name of the domain entity type I am creating the extension method on.
Then I am going to loop over all the property members of the current domain entity type and add assignment statements that copy values from domain entity property and into the corresponding DTO property. Once again, this is driven by convention as opposed to configuration i.e. the property on the DTOs are assumed to be the same name and type as the corresponding properties in the corresponding domain entities. This will ensure type safety and keep the generation code simple.
Finally, I close out the method, class and namespace. Note that I am adding the extension class and methods to the same namespace as that of the DTO for simplicity reasons.

Once this is done, I will build the solution and inspect my consuming app ConsoleApp9 for any generated code and sure enough, I see it (if the build succeeded):

Note that I didn’t have to restart Visual Studio for these changes to reflect. Turns out if you create a new source generator i.e. for the first time and do a build, VS picks it up. Its only any subsequent changes you might make to the types or the generated code that it needs to be restarted for.

The generated code also looks like its correct, if you can build its a good indicator that the code is syntactically correct otherwise the original build would have failed if I had made a typo whilst generating code.

I can easily show this, I will fudge up a semi-colon in the return statement and re-build the solution (normal build will not throw up errors):

But when I go to the generated entity, the semi-colon is still there!!🤔 Of course, I need to restart VS to see that, don’t I? 💡🤦‍♂️

Now I can start using this mapper from my consumer app because the ToDto extension method just magically appears (that’s not to say I don’t need to import the ConsoleApp9.Domain.Dtos namespace where all this generated code lives, I absolutely do but I will let Re-sharper and/or intellisense help me do that!):

Just to make sure it works as well as it looks, I will simply JSON-ify the DTO (the ultimate destiny for almost all DTOs anyway) and dump it on the console:

Looks like it!

So far so good! I’ve got the basic DTO and mapper working but I am not out of the woods yet. Say now the domain asks me to record an employee’s address, for this I will create a value type Address and add a nullable property of that type to the Employee domain entity (its not required to have a home address right from the start, an employee can always add their home address once they have a permanent place to stay):

I will just do a quick re-build at this stage to see what the generator outputs (if anything):

The new types have been added! so, yay!

But if I open the EmployeeDto class, at first blush everything seems fine! But there are two problems both highlighted in orange:

The DTO mis-identified the type of the HomeAddress property as Address as opposed to AddressDto?, and
The mapping function is directly assigning the entity property to the DTO property which will not work since the type is a complex type and will need to be further converted to DTO. Due to the mis-identification of the property type in problem 1, the build also didn’t fail because the mapper is assigning the property of an assignable type i.e. Address?.

To fix these I essentially need to:

a. Detect if the type of the domain entity property being evaluated is a complex type or not.

b. If its complex type, then 1) Use AddressDto? as the property type (for nullable types) instead of Address? 2) instead of directly assigning (as I had been doing thus far), invoke the corresponding ToDto() method on the domain entity property. This will convert Address to AddressDto for e.g. Otherwise, do what I am doing currently because the property is not a complex type.

For determining if the type is a complex type or not, I will be using the semantic model exposed by GeneratorExecutionContext because the semantic model is the one that contains information on what things mean for e.g. if something is a reference type and a class etc. which is what I need to find out. I will modify the BuildDtoProperty() method and add two convenience extension methods as shown in the gist below:

As it turns out this semantic information about properties lives inside the semantic model as ISymbol instances and for properties more specifically in IPropertySymbol instances and exposes type information. The IsOfTypeClass method, checks to see if the property type is a reference type, its kind is class and the property type must be within the same namespace as the original namespace. This last one is important i.e. both DTO types should be in the same namespace, this means no external types are allowed because it will be hard to be certain if that type is controlled by the client application or not, hence might be difficult to decorate with custom attributes and appropriately convert to a DTO. For e.g. if I create a String property in my domain entity, without this check, the generator will create a property of type StringDto which makes no sense since String is a .NET CLR type, not a custom domain type and is therefore not controlled by the consuming application.

The IsOfTypeStruct method is mostly the same except for checking to see if the type is a struct value. If either of these is true, then I want to suffix the original type name in the property with the word “Dto” to reference the DTO class. Whilst at it, I will also take care of nullable types as well! It would appear that IPropertySymbol.Type.Name excludes the “?” from the nullable types, IPropertySymbol.Type.ToDisplayString() includes it. The former is useful for complex type because I need to suffix “Dto” for the DTO property whilst the latter will work for primitive types because the type name can go into the DTO verbatim. Using display string for complex type could result in the type name looking like: Address?Dto? which is syntactically wrong and will fail to compile.

Lot of this code is trial and error. Exploring the Roslyn syntax/semantics API can help in understanding which types contain what information but good ol’ trial and error is less painful than trying to debug the source generator. Its doable by calling Debugger.Attach in the Initialize method but I’ve found that it tends to create a vicious debug cycle where VS prompts the UAC dialog everytime something causes the debugger to run for e.g. any time you change anything in the code. Dismissing that dialog half a dozen times everytime you alter a single letter in code is a NIGHTMARE so I wouldn’t recommend that approach!

Finally, I will change the mapper generation to include the ToDto invocation against any complex type properties. This is straightforward since it builds on the work already done above. For this I will modify the member loop in the main Execute method to do the same complex type vs primitive type check, and for complex type I will append the null coalescing operator and “ToDto()” suffix at the end (to make it null safe):

Build the solution to generate the updated code:

That’s more like it!

And now run the consumer app to make sure that its all working:

And the serialised version of the DTO agrees!

If the address was never set, the serialised value will simply be null but the app won’t crash due to a null-ref exception like it would have done if I hadn’t made the dto conversion null safe for nullable types.

Finally the domain is asking me to change the Employee definition to keep a track of all the assets an employee has been issued by the company for e.g. business phones, laptops etc.

To accommodate this request I will make 2 changes to the domain model:

Create 2 new value types called CompanyAsset and AssetCode, in the domain and decorate them with the GenerateMappedDto attribute. An asset MUST have a code associated with it. This is just to see how code gen will work with nested complex types, domain modeling is outside the scope of this blog series.
Add an IReadOnlyCollection<CompanyAsset> property calls AssetsAllocated and expose a method on the Employee class to add assets to the collection when they are allocated to our employee. So now the entity class looks like:

I’ll be able to build on the work done so far for much of the remaining challenge but generic types still need to be handled properly, more specifically generic collection types as in this case. If I were to build the code in its current form, the generic collection property(ies) will have the same problem of mis-identified types. So to address this what I want to do is: a) add a DTO property with type IReadOnlyCollection<CompanyAssetDto> b) Invoke the ToDto() method on each item of this collection in the mapper extension and so on down.

The challenge now is to detect if the property type is a generic type and suffix all complex type arguments with “Dto” so, IReadOnlyCollection<CompanyAsset> will become IReadOnlyCollection<CompanyAssetDto>.

!!! You are now entering messy, hacky code territory!!!

Turns out this is a ~~little~~ quite a bit more difficult to achieve using the semantic model alone so I will also use the syntactic model (please read the inline comments in code to get some idea of what the hell is happening):

The way I figured which syntax types I need to use, is with this little nifty tool called Syntax Visualizer. You can install this if you modify your VS installation to add the .NET Compiler Platform SDK workload, via the Visual Studio Installer app. The way this works is by simply clicking on the type in your code that you want to visualise and the visualiser will automatically refresh and open up the corresponding node in the syntax tree:

What I am interested in is the TypeArgumentList node of the GenericNameSyntax node for this property

Basically it comes down to which types in the type argument list should have the Dto suffix and which shouldn’t. All custom types i.e. the ones defined in the Domain.Dtos namespace, need a Dto suffix whereas all .NET types, don’t. In the BuildTypeName() method, the INamedTypeSymbol::TypeArguments will carry all type arguments listed on the generic type whereas the node under consideration only refers to one type at a time, so I’ve got to do a “lookup” and then determine if the type in the type argument list is custom or not and then return appropriately suffixed DTO type names.

Ok! Property type name sorted, onto the mapper method…

This is getting hackier (or at least uglier) by the minute because I am focussing on getting it to work first, I will eventually put a more cleaned-up version of the code up on Github but for now I will highlight the chunk that fixes the conversion methods for properties with generic collection types.

Essentially, if the generic type argument is a primitive type then conversion is basically direct assignment from entity to DTO. But if any type argument is a custom type, then I will attach the “ToDto()” call to the assignment to convert from the entity type to DTO type. I am also making an assumption about the entity and the DTO, that is generic type arguments are only used with collection types like the ones I mentioned previously (so no Task<T> in domain entities for e.g.). Therefore if I find generic types with complex types as arguments, then I will also generate extension methods to convert a collection of entity types to a collection of DTO types:

I am having to handle Dictionaries differently because they have 2 type arguments as opposed to just one and either TKey or TValue could be a custom type. I still have to fix this bit (hence the 🤷‍♂️) but at this stage I am wondering if this whole thing is worth it in the first place? I mean just look at the code so far!! Horribly unreadable mess!

Anyway, this results in the EmployeeDto class that also has extension methods to convert collection type properties in the domain entity to their DTO counterpart:

Finally!

…

HOLY CRAP! That was a lot! Am I done though? For this particular source generator, I think yes. What’s “outstanding” i.e. is niggling at the back of my mind ? Well, a couple of things at least:

Putting DTOs closer to where they are used: currently the code puts the generated DTOs within the sub-namespace within the domain and this could be a bit of problem because DTOs serve a different purpose than domain entities so they should be colocated with the thing that uses them. In this case, it should be the host project for e.g. web api etc. I’ve not yet found a way to put the generated code in a custom location or if its even possible. If it is, then custom namespace could be passed to the attribute which the generator could use but at the moment I am not sure.
Performance profiling of the build with and without source generation: To be perfectly honest, in my sample scenario, I didn’t notice a whole lot of build slowdowns. A couple of seconds to do a clean build doesn’t sound a whole lot, of course this is going to be solution dependent. Given a large enough solution and dog slow build machine, things could change. The source generator’s Execute method itself takes < 20 ms on my laptop when doing builds inside Visual Studio (I’ve added a little bit of timing code that roughly measures this)
Testability of source generators: Because throughout this entire exercise my focus was on exploration and trying to see what’s possible, I didn’t really TDD it (sue me! Its perfectly fine to not write tests when you are exploring/sketching because you don’t know how will it pan out!) I will tackle testing in a later post (accompanied by a fully refactored version of the code), assuming I haven’t given up on this problem by then! By the looksof things, this might be possible I will have to see.
Debuggability of source generators: One way to debug source generator will be to output another .cs file with logs written out as C# comments. The process of emitting this is no different than what I have shown here. Key thing to remember, the hintName argument in the context.AddSource(...) should be whatever you want to name the generated .cs file and the encoding MUST be UTF8, don’t let the optionality of that parameter fool you. F5 debugging of source generators is horrible like I have already mentioned in a preceding section.
Some edge case domain entity structures might not be covered by the current generator or might not produce the correct output: In order to keep the generator relatively simple and not have it do too much, I would keep special customisations out of it. So no ability to inject custom behaviour into the DTOs and/or extensions.
Ignoring properties that I don’t want mapped: This is fairly straightforward to do and can be achieved by decorating such properties with another custom attribute may be [ExcludeFromMapping] or something. I might do this by the time you read this post.
Un-mapping DTOs: i.e. if you don’t want an entity to be mapped to a DTO anymore, just remove the GenerateMappedDto attribute from the class and the generator will not generate code for it thereby effectively removing it. The generated code doesn’t get checked into the source control, so no harm either way.

Conclusion

I do see the value of source generators in affording productivity gains with regards to repetitive tasks that developers do that don’t change from one to the next all that much. For e.g. generating mapping code like the one I have shown in these posts, the canonical example of automatically generating implementations for interfaces for e.g. stubs etc and another one that I would like to try out : auto-generating tests for a public API, although this might also mean somehow auto-generating the whole test project and then generating test code into that project.

I find it a bit limiting that only new code could be created but existing code couldn’t be modified, although I can see where they are coming from on this. Allowing source generators to modify engineer written code can be risky due to potential flakiness and stability risks.

I also find limited debugging options a real pain and the fact that I have to restart VS multiple times to see the changes reflect but I am hoping these are just teething problems because VS Code is a lot better experience however, it doesn’t have the capability of showing the generated code so its a bit like flying blind.

Discovering the Roslyn syntax APIs with trial and error is quite time consuming but tools like Syntax Tree Visualiser help and once you’ve used the APIs you get some sense of what you need to use and then its just a matter of Ctrl + . exploration to find the right method/property to invoke.

Anyway, this has been fun, the code is on GitHub!

Header image source

Using Mikado Technique to Migrate to .NET Core

Aman Agrawal — Wed, 20 Jan 2021 00:19:15 +0000

Not long ago in my team, we completed the migration of all our services to .NET Core. Of note was a mission critical full .NET framework legacy Windows Service and in this post I would like to share how we completed that migration.

We established 3 simple constraints to meet during this migration: changes should be safe (i.e. not break existing behaviour), incremental (keep the changeset small enough to be reverted easily if needed and not do a Big Bang Release) and non-blocking (should not block critical business features), all the while making sure that the team is aware of the effort and can easily pick up the thread at any point!

In order to meet these constraints we decided to employ the Mikado technique which allows you to scope work out and control risk better than just doing a BBR.

What is Mikado Technique?

The key philosophy behind Mikado technique, is to identify the goal of the refactoring first for e.g. Migrate the app to .NET Core . Then making the most obvious change that will lead to that goal for e.g. Change the target framework of the project to .NET Core and then see what fails to compile or if the tests fail. If there are build errors for e.g. some third party package is incompatible with .NET Core, then you’ve identified things you need to do before you can change the target framework. You then revert all the changes up until this point and start by fixing the issues reported in the build errors first for e.g. find a compatible package version and install that!

Repeat this process and at each step identify dependencies and arrange them in a graph structure as nodes until you reach leaf nodes i.e. nodes that don’t have further child nodes (dependencies).

You then start the actual refactoring at these leaf nodes and work your way up to the ultimate goal. At each step you should run your tests and if everything still works, mark the node(s) as done and commit the changes to the source control ( incremental ). You should also be able to deploy these changes at any time without breaking the existing behaviour ( safe ) or revert the change if a business critical feature takes priority ( non-blocking ).

Important thing to keep in mind that the Mikado graph also serves as a communication tool in addition to a refactoring tool, the artifact of the process is a graph that captures changes to be made in order to reach the ultimate goal. Your team can then use this graph to pair or mob program or even just pick up where you left off. It also gives a good indication of the progress on a refactoring effort thus reducing the “holiday factor” (or “bus factor” if you must be dark and grim).

How much detail you put in individual nodes is up to you, important thing is it should communicate the change clearly to your team members (and yourself). In some cases, we also added code screenshots into the diagram to help reference key pieces of code. Whatever works for you!

Migrating to .NET Core

Because in our case doing this with just one Mikado graph would have been akin to a BBR, we decided break up the effort into 4 distinct phases with each phase having its own Mikado graph:

Consolidate external package dependencies across the solution : fairly isolated job, package dependencies can be upgraded or consolidated. The advantage of separating this into its own phase is that breaking changes can be addressed with a bit more peace of mind that nothing else outside of this has been changed so if anything does go wrong, the revert surface area will be small.

We took a very pragmatic approach here, that of retaining what might be considered “old fashioned” libraries, as long as they were compatible with .NET Core, for e.g. Unity DI container and TopShelf. Just because there might be shinier packages available is not a good enough reason to switch. It introduces unnecessary risk and increases the scope of the work and goes against both (safe) and (incremental) constraints of such an effort.

Once the .NET Core migration is done, we might very well plan in another scoped refactoring to migrate to more modern libraries but that will be a Mikado Graph of its own. You see how this technique can help control risks by scoping?

Simplify and unidirectionalise inter-project dependencies : we drew out the dependency diagram of all the solution’s projects which helped us see the spaghetti mess a lot more clearly. Many of the projects not only had a direct reference to another projects, but also transitive references to them. Several other projects had independent references to 3rd party packages, for e.g. Newtonsoft.Json was referenced in multiple projects which could easily be referenced by a shared project instead. This makes upgrading these common packages easier because they are all in one location.
Migrate from System.Configuration to IConfiguration : swapping out the configuration system was probably the most tricky part because it did require changes throughout the solution. The service follows Ports and Adapters architectural style, so each adapter has its own configuration that it hydrates from the configuration system at bootstrap. This was also a phase that had a high failure impact, because you usually don’t catch configuration errors until its too late. So we extensively tested this part by: writing automated verification tests and running the service locally before merging the changes.

In the Mikado graph sample shown below, you’ll notice that its made up of several sub-graphs and some of the nodes have common dependencies. This is usually an indication of design flaws in code, but solving that one dependency unlocks two other branches. We also decided to tackle these sub-graphs in an order that made sense for us, for e.g. we picked out the most isolated ones first and then worked towards the ones that have a bit more impact. Any additional refactoring that was necessary to make forward progress, we did it!

Upgrade target framework to .NET Core (for executable projects) and .NET Standard (for library projects): probably the least painful part of the whole exercise because we did 1 to 3 first so this this was just about changing the target framework moniker from net461 to netcoreapp3.1.

An example graph for phase 3 looked like this :

This is the real graph for migrating from System.Configuration to IConfiguration

All throughout each phase, we tested the changes locally, constantly pushed to the main branch and CI/CD-d into production. Once deployed, we did several test runs in a pre-production (Acceptance) environment to make sure that no runtime exceptions or crashes sneaked into the hosted environment. Because our service only runs between 8AM and 10AM, it gives us a bit of breathing room outside of these hours to run in reduced availability mode if we need to and test stuff out. If we had higher availbility requirements, then meeting the safe and incremental constraints of the effort becomes even more crucial!

The end results of this process were:

Not one Bing Bang Release
Team was fully aware of what’s already done and what’s remaining
We completed the migration within the planned timeframe
Changes were always safe and incremental
We migrated the service to .NET Core without a single incident.

After this service, we also applied this technique to other services that were relatively simpler to migrate and ended up with the same results. Bottomline: migration completed well within the timeframe without a single outage.

If you haven’t tried Mikado technique before, I would highly recommend it!