Forem: Willem Pino

Design to Duty: The Accounting and Reporting Systems at Adyen

Willem Pino — Mon, 23 Aug 2021 13:51:43 +0000

In this blog, we will have a closer look at how we make decisions around our accounting system and how it evolved as a consequence. We will do the same for our reporting and analysis frameworks.

This is part two of a series. So, if you have not done so already, it might be nice to start with the first blog. In that one, we talked about what Adyen does at a high level, how we think about choosing between home-grown and open-source software, and how this shaped our edge services.

The themes that were laid out in part one will return and be referenced in this blog. If you didn’t read the first one, it might be that some context is missing.

Accounting system

Once we have processed a payment, the next step is accounting for it. This is needed because after processing transactions, we receive the money from our partners and we need to determine how much to settle to each merchant. Of course, we also need it for reporting.

For every payment that enters the system, we do double-entry bookkeeping. The way we ensure that we do so correctly is quite unique to Adyen. The only way to add new records to the accounting system is by means of templates. A template in this context is a recipe that takes certain amounts and accounts as input and converts them into specific journals that can be inserted into the ledger.

These templates are mathematically verified. To achieve this, we represent the amounts that serve as inputs by logical entities and prove that every combination of amounts will result in a net sum of 0. This verification is fully automated and runs on every change to the templates.

All of this means we can guarantee that if at any time, we sum up all the records in the accounting system, the result will always be 0. Combine this with the aforementioned double-entry bookkeeping, and it means for every euro that ever went through Adyen, we know exactly where it came from and where it went.

We leverage this same system of accounting for our banking platform.

It might be superfluous to mention it, but our accounting framework is also written in-house. Here the choice was evident, it is as core as can be to our business, and nothing in the open-source landscape came close to what we wanted.

Underlying Database

Having all these nice things does come with some cost. Over the lifetime of a payment transaction, about 50 rows have to be inserted into the accounting database. This means that per second, just for the accounting system, the amount of inserts is an order of magnitude higher than the hundreds of transactions we process every second. Some time ago these thousands of inserts a second started to cause issues for our PostgreSQL database. This blog has more information on maintaining such large PostgreSQL databases.

We had already split a reporting database from our accounting database to minimize reads (more on this below), but at some point, even with mainly inserts and updates, the database didn’t scale anymore. This is when we decided to shard our accounting database into several accounting clusters.

Sharding strategy

Now we have several accounting databases, or clusters, processing side-by-side. We considered incorporating domain knowledge into the sharding strategy, but, for several reasons, eventually decided to go for a simple round-robin strategy with some parameters we can adjust.

First of all, every rule we considered would bring its own problems. For example, if we would put each processing merchant in one database, you still need to go to every shard when you need aggregate data to send to our payment method partners. The same holds the other way around. If you split by payment method you still need to go to every shard when you need aggregate data on the merchant level.

Incorporating business logic would also complicate the routing, a round-robin strategy is very simple and robust and you do not have to think about balancing your shards. Finally, you lose a lot of flexibility. At the moment we can just add a new cluster whenever we need more capacity or remove one from the routing when we see strange behavior we want to investigate.

In the end, we decided the benefits domain knowledge routing would offer were not worth the loss of flexibility and increased complexity.

Migration

The migration to a sharded accounting database was quite painful. This was due to two things. First of all, the accounting logic in the code, pretty heavily embedded in any payments processor, working on the assumption there was just one accounting database. This had to be rewritten.

As an example, consider a payout to a merchant. This needs to come from one account. However, because you processed these transactions on different clusters, the money needs to be transferred from one cluster to the other in order to end up in the same account. To do this without compromising on the strict correctness requirement was quite difficult. In the end, we created several jobs and processes that use back and forth messages between the clusters to keep everything aligned.

The second complicating factor was that, if we received reports on processed transactions from our partners, we needed to match them to transactions in different clusters. Instead of parsing a file and matching it directly to the transactions, we introduced a two-step framework that would first parse a file and then split it into the relevant parts for each cluster. The second step was to match the relevant transactions within the clusters.

Whereas the first problem was solvable in a generic way, the reconciliation needs to happen for a lot of different very custom integrations so this was a real team effort.

Future challenges

From a scaling perspective, the risk that this approach introduces is that any process that depends on all accounting databases being up to date or available becomes a potential liability.

Historically, processes such as our back office (admin area) would interface directly with the accounting databases to display data or to make corrections. If one of the accounting clusters is not reachable this cannot be done anymore.

This is not a problem when there are only a few clusters, but as the number of accounting clusters grows, the chance of any of them being unavailable, planned or unplanned, grows with it. This means that instead of interacting with it directly, we need to do so with an intermediate that will mitigate this risk.

Also, rebalancing the clusters is something that is essential for traditional sharding (inside one database) but has not been implemented yet for this setup. If we add a new cluster, it will be empty while the old clusters keep growing. How do we avoid the original clusters becoming too big?

Connecting the edge services with the accounting system

Talking so casually about the downtime of our accounting clusters hints at different priorities in this part of the system compared to our edge systems discussed earlier. Whereas for the latter the priority was uptime, for the accounting systems we don’t mind if they are down for a little bit as long as we can guarantee their reliability and consistency. For me, it was interesting to see how these systems are tied together.

In a very naive solution, these conflicting principles cause a conflict. The PAL would forward the payment to the accounting system to save it. If something goes wrong, do you still save the payments with a risk of inconsistent data, compromising on the consistency priority of the accounting system? Or do you stop processing, compromising on the uptime priority of the frontend? We went for a failsafe in between.

The Payment Processing Layer (or PAL, see part one) saves the payment in a local database and a separate process picks it up when possible. This process will run behind if we have doubt about whether we can guarantee consistency in the accounting or if an accounting database is not available but that will not impact the availability of the PAL itself. An added benefit of this setup is that if the PAL crashes, no payments are lost because the queue is stored in a database and not in memory.

In another blog of ours, you can find more info on how this asynchronous queue between the different layers works.

Technology choices

The large accounting database and the short-lived queue both use a PostgreSQL database. This was again a pragmatic choice. In the beginning, the requirements on the system were not that high, so we went for a one-size-fits-all solution for databases. It might be that better solutions for queueing were available at that time or since, but we felt that it was not worth the additional complexity they would add to our system.

We have since been pleasantly surprised by how well PostgreSQL has performed for both use cases. At this point, we have dozens of local databases that can be instantiated dynamically at application startup and transactional databases of hundreds of terabytes running on the same technology.

This shows that specialized solutions that were designed to tackle specific problems that occur at very large scales might not be needed for smaller applications (or even quite large ones) while they often do add a lot of complexity. Of course, this is always a balancing act, because the solution might have nice specialized features and it is comforting to know that it will definitely scale.

Reporting

Creating separate accounting clusters creates a new challenge. Namely, how do you generate reports for every merchant and payment method every day from all these data sources?

We did have a reporting database where we saved the relevant data in a denormalized form. Initially created to minimize reads on the accounting database. However, relying on this might work for a while, but eventually, it would be just another bottleneck.

For this part of the system, the priority was throughput. We decided to stream the data from the clusters and to have specialized processes consume and preprocess the data for specific use cases. This way there are only a limited number of processes that have to read the stream and when the report is generated a lot of the work is already done.

The reporting tables can be split over as many databases as needed.

At the time a lot of different technologies were considered for this, especially Kafka. Our current processes were under quite some pressure so time was of the essence. We needed a high throughput, low latency streaming framework that could ensure exactly-once processing, even if processes crashed.

We scoped a lot of open source technologies but none offered the feature set we were looking for. For example, exactly-once delivery was not yet supported by Kafka.

On the other hand, we had a lot of familiarity with another technology that was close at hand and had proven very reliable. At this point, you might not be surprised to learn that this was PostgreSQL. For the same reasons, we used Java to write the application code on top of it.

Even though we did have to do some rounds of optimizations and we are missing some features, such as topics, we are happy with our choice. The setup stood the test of time even though traffic increased by an order of magnitude.

Even though we initially chose to write something ourselves, this is not a definite choice. In fact, there is an ongoing discussion whether this framework will allow us to scale another 20x or whether we should swap it out for an open-source solution such as Kafka. The 20x scaling factor is a rule of thumb we often use internally when designing a solution or determining whether we are still satisfied with it.

Analysis

For our data analysis setup, we did not build much in-house but chose to adopt industry standards. We run a Spark Hadoop cluster combined with Airflow. Deploying it on our own servers was an effort but now it is running smoothly. There is a blog about the initial shift.

Remarkable about the setup is that we still use our custom streaming system to verify all the information is actually consumed and correct.

With the Spark Hadoop cluster in place, the main struggle was to feed the results back into the synchronous systems. Here we describe how we did this for monitoring. After this worked for that use case, we expanded it into a generic framework that can also score the machine learning models in real-time.

Conclusion

The focus of this blog was on payments processing but, as mentioned before, the concepts translate almost perfectly to the other business contexts such as the bank. This is because all systems have high availability and latency requirements on the edge services and strong consistency and reliability on the accounting layer. In the reporting and analysis frameworks, we tie all of the systems together.

This similarity in architecture, together with our conservative tech stack, allows developers to easily switch between teams. They already know the general design and the technologies used, even if the business context is completely different.

We hope that this blog made both our architecture and the way we arrived at it clearer. Perhaps it will also influence how you evaluate design choices for your own company in the future.

Technical careers at Adyen

We are on the lookout for talented engineers and technical people to help us build the infrastructure of global commerce!

Check out developer vacancies

Developer newsletter

Get updated on new blog posts and other developer news.

Subscribe now

Design to Duty: How we make architecture decisions at Adyen

Willem Pino — Wed, 28 Jul 2021 13:46:55 +0000

At Adyen, we have a very pragmatic way of approaching problems. As a result, we use simple tools to achieve great results.

The goal of this blog is to walk you through the challenges we faced in scaling our system, how we tackled those challenges, and how our system looks because of those decisions. In particular, we will pay attention to the choice between home-grown solutions versus open-source software.

In the first installment of the blog, we will discuss these topics as they relate to our edge services, and in the second part, we will do the same for our accounting and reporting layers.

Instead of just explaining how our architecture looks, we thought it would be interesting to explain why our architecture looks that way. At Adyen, we are firm believers in giving developers responsibility. This means responsibility for the design and implementation of a system and also for the security and maintenance of that system. Because the design of our system is done by engineers who are also on duty. These engineers are strong contributors in deciding how to build something, which sometimes leads to counterintuitive results.

Building home-grown solutions or choosing open-source software

New joiners to Adyen are often surprised about some instances in which we built stuff ourselves. Because of this, we thought it would be interesting to discuss some of these choices while going through our architecture. For an extreme example of building it ourselves, see this short video about why we built our own bank. Point 3 in this blog, about the principles we used to scale our platform, talks more about which technologies we consider for adoption.

When we are confronted with challenges, the proposed solution is often either to introduce a new tool or framework or to write something in-house. From Adyen’s perspective, writing something yourself will give you more flexibility but it will cost more time and probably have fewer features than an open-source alternative.

The usability of an open-source option is likely higher due to better documentation and a larger community, but it might be more complicated, because of additional features that we don’t need and investments in training the people that need to work with it. The security of an open-source option will probably be better because many people vetted it, but the attack surface is also almost always larger.

Our view on vendor solutions

A lot of businesses will also consider vendor solutions, so do we. However, we want to focus on the core flows in our system, and for those, we never choose a proprietary solution because we want to have full control.

We buy instead of build, if the use case is peripheral, it does not have to be real-time, we do not have to embed it, or if it is a good value proposition. An example of this would be some of our KYC background checkers. Of course, avoiding lock-in is an important consideration here.

Adyen at a glance

Adyen is in the business of processing payments. We receive a payment from a merchant; we send that payment to a third party (such as a credit card company) and we return the result. This happens hundreds of times a second, 24/7. We also keep track of all the transactions so we can pass along the money to the merchant once we receive it from the bank. Of course, we also provide reporting to our merchants.

We do this for hundreds of billions of euros every year. In the last couple of years, we have introduced additional products such as card issuing, a bank, and Adyen for Platforms, which helps platform businesses like ride-sharing or marketplaces. We do all of this on a single platform, in a single repository (monorepo), almost exclusively written in Java.

Our system is divided into several parts that function relatively independently from each other. They are split along business domains. For example, we have one part centered on payment processing and another part centered on the bank. In the data layer, they are tied together. The same design principles are applied for each of the subsystems. So while we mainly cover payment processing in this blog, the architecture is similar across the board.

A payment sent in by a merchant will arrive at our edge layer. Here we will do synchronous processing, contact a payment method if needed, and return the result to the merchant again. Availability and low latency are paramount here. In parallel, we sent this payment to our backend systems where we store it in our accounting system. Accuracy and reliability are the key priorities in this part of the system. Finally, the payment ends up in our data processing layer, where the throughput becomes a major concern. We will go through each of these layers, discussing the choices we made that shaped them.

Edge services

Every API call to our systems goes through our edge services first. The payment can come either from a payment terminal, from a mobile application, via a direct API call, or from a payment page hosted by us. The Payments Acceptance Layer (PAL) is a crucial service in our edge layer. All payments pass through it.

This application will send the payment to our other internal services. These other services can be a risk engine, a service for saving or generating recurring payment tokens, or a service for computing which connection will lead to the highest authorization rate. It will also contact (through an intermediary service) the partner payment method or scheme that processes the payment.

An important design feature is that all payments are abstracted at the PAL so the system can treat them as equal. There are, of course, differences between them. Some will have additional metadata (for Point of Sale transactions this might be the ID of the terminal that processed them). However, they all go through the same system and are stored in the same databases.

The engineers who handled the initial design had already gained experience at a previous payments company. In that company, a payment that would come into the system would keep some state in the edge layer. If a new payment would arrive that modified the original payment, for example, a refund, it could immediately be processed as all the required information was already stored in the edge layer.

The problem with this setup is twofold. An application could not go down for maintenance or can crash without affecting our ability to process transactions. The other problem is that a new machine could not immediately process the same volume as an old application. Some transactions needed to go to a specific instance. The state in the application made each instance unique.

Stateless services

Taking a step back, it is possible to see why we did it differently at Adyen. The priority for this part of the system is to be highly available and have low latency. We should always be able to accept payments and process them as fast as possible. Instead of keeping the state in our edge layer, we decided to process modifications asynchronously, which keeps the edge layer stateless.

As a result, any PAL instance can be shut down without impacting our ability to process payments, and a new PAL can immediately be as effective as the other PALs already running. This makes our horizontal scaling linear. In other words, if one PAL can process X payments per second, two PALs can process 2X payments per second. This mechanism has been basically unchanged since the start of the company, testifying to its power.

The fact that the edge services are stateless means that they cannot write directly to centralized databases. This is not only beneficial for scaling the system but also a very nice security feature. All our externally exposed systems are prohibited from writing to central databases, reducing the risk of attackers compromising or stealing valuable data. By ingraining a strong sense of security into developers, we can have security by design, instead of having to retroactively patch holes in the system.

Distributed databases

More recently we faced the challenge of making our payments API idempotent. This means that if a merchant sends us the same exact payment twice, we should only process it once but return the same response in both cases.

As you now know, we do not want to achieve this by restricting payments of some merchants to certain machines, as this would mean the machines are no longer linearly scalable. The information needs to be available locally, so we eventually decided on integrating Cockroach, a distributed database, with our PALs.

We could have built something ourselves here (probably on top of PostgreSQL) but this is really not our core business and there were several open-source options that satisfied our criteria. Nevertheless, getting properly acquainted with the DB and optimizing it to the point where we were satisfied with it required a substantial effort. For another example of a decision between open-source and building ourselves, see this blog on our graph risk engine.

Future Iterations

The next big step for our edge services would be to scale them dynamically. We manage our own infrastructure in several data centers around the world and have bare metal available, but the hardware and software are still tightly coupled.

The reasons we are not deploying to the cloud are part historical, part legal, and part technical. In part, because we now need dynamic scaling, we are moving towards running our system on an internal cloud. A blog on the containerization effort is forthcoming.

Conclusion

I hope this blog shows how we make decisions at Adyen about how to scale and which technologies to use. In all our choices we are highly pragmatic, adhering to the KISS (keep it simple, stupid) principle almost to a fault. This sometimes runs counter to established doctrines in the industry but it shows that you can build a solid company with a small set of trusted technologies.

In the next blog, we will look at the architecture of our accounting and reporting systems. Unlike our edge layer that has remained relatively static, we actually had to redesign the accounting system several times. In that blog, we will also expand more on building home-grown solutions versus choosing open-source software.

Technical careers at Adyen

We are on the lookout for talented engineers and technical people to help us build the infrastructure of global commerce!

Check out developer vacancies

Developer newsletter

Get updated on new blog posts and other developer news.

Subscribe now