Forem: Aivars Kalvāns

TigerBeetle as a file storage

Aivars Kalvāns — Sun, 07 Dec 2025 00:00:00 +0000

Could not keep it under the rug until April Fool’s Day

TigerBeetle is a reliable, fast, and highly available database for financial accounting. It tracks financial transactions or anything else that can be expressed as double-entry bookkeeping , providing three orders of magnitude more performance and guaranteeing durability even in the face of network, machine, and storage faults.

Continuing my if all you have is a hammer, everything looks like a nail journey, I wanted to store arbitrary binary blobs in TigerBeetle to protect them from storage faults. If I can do that, I can store anything.

The id field of my Accounts will contain the filename (16-byte limit). I will store the total file size in the user_data_64 field and the filename length in the user_data_32 field (to simplify decoding). And my Accounts will have this nice property that credits_posted will contain the actual number of bytes written. I can detect failed uploads and resume that (a future TODO).


def create_a_file(filename, size):
    if len(filename) > 16:
        raise ValueError("Invalid filename, more than 16 bytes")
    account = tb.Account(
        id=int.from_bytes(filename.encode()),
        user_data_64=size,
        user_data_32=len(filename),
        ledger=FILE,
        code=FILE,
    )
    errors = client.create_accounts([account])
    if errors:
        raise ValueError(errors[0])
    return account

What makes me unhappy is that I have not found a good use for user_data_128 on the Account record. Such a waste of resources!

I will store the actual bytes in Transfer user_data_128, user_data_64, and user_data_32 fields. That gives a total of 28 bytes per Transfer, and the Transfer amount will contain the number of bytes used in the Transfer. Which will be 28 for all Transfers except the last one containing the remaining bytes.


            transfers.append(
                tb.Transfer(
                    id=tb.id(),
                    debit_account_id=system_id,
                    credit_account_id=file_id,
                    amount=len(block),
                    user_data_128=int.from_bytes(block[:16]),
                    user_data_64=int.from_bytes(block[16:24]),
                    user_data_32=int.from_bytes(block[24:]),
                    ledger=FILE,
                    code=FILE,
                )
            )

Because TigerBeetle uses double-entry bookkeeping, I will transfer all bytes from a system file “.” (debit side) to the desired file (credit side). Which is extremely useful for audit purposes to verify that debits_posted on the system file Account is the same as credits_posted on all file Account records.

As for getting data out of TigerBeetle, I can retrieve all credit Transfers for the specific Account. They are always correctly ordered by the timestamp field as guaranteed by the TigerBeetle.


        timestamp_min = 0

        while True:
            transfers = client.get_account_transfers(
                tb.AccountFilter(
                    account_id=file_id, flags=tb.AccountFilterFlags.CREDITS, limit=BULK, timestamp_min=timestamp_min
                )
            )
            for transfer in transfers:
                timestamp_min = transfer.timestamp
                ...

            if len(transfers) < BULK:
                break
            timestamp_min += 1

All that put together, I performed tests on some of the most valuable files I did not want to ever looose:


(venv)  du -b ~/Downloads/homework.mp4
104718755 /home/aivarsk/Downloads/homework.mp4
(venv)  time ./tbcp ~/Downloads/homework.mp4 tb:backup.mp4

real 2m3.697s
user 1m4.408s
sys 0m1.568s

So, you can store your files durably at speeds close to 642 kB/s. Now, let’s retrieve the file and store it on the disk:


(venv)  time ./tbcp tb:backup.mp4 copy.mp4

real 0m47.588s
user 0m27.027s
sys 0m0.553s

Downloading is around four times faster at 2,228 kB/s! And of course, I verified that not a single bit was lost during the round-trip:


(venv)  sha256sum ~/Downloads/homework.mp4
4ee75486c7c65a5c158f7f6b2ca6458195aa25b155b0688173b4b52583ce4cac /home/aivarsk/Downloads/homework.mp4
(venv)  sha256sum copy.mp4
4ee75486c7c65a5c158f7f6b2ca6458195aa25b155b0688173b4b52583ce4cac copy.mp4
(venv) 

If you want to store your valuable files, guaranteeing durability even in the face of network, machine, and storage faults, here is the full source code

Running TigerBeetle without a control plane database. Part one.

Aivars Kalvāns — Sat, 06 Dec 2025 00:00:00 +0000

TigerBeetle is a database built for financial accounting, and the only record types available are Accounts and Transfers. That might be enough for the simplest accounting setup, but not for any realistic financial product.

The way TigerBeetle solves that is by requiring an Online General Purpose (OLGP) database in the control plane that stores metadata and mapping between TigerBeetle’s identifiers and identifiers used by the rest of the systems. This can be done, and the documentation is really nice on guiding you, but… what about the dual write problem?

Here’s an idea: what about running TigerBeetle without the control plane database? I am not saying you should do it, but I wanted to try out if that is possible and the best ways to do it. This is a work in progress.

Challenge

It depends on the banking and payment card system, but often your account or card is not just a single physical account record. It is a “product” and “product agreement” that links together multiple accounts, conditions, metadata, and other types of records, just to tell you what the current balance is. Years ago, I worked with something we called “analytical accounting.” We had separate accounts for purchases, cashout, refunds, credit/debit transfers, interest, and different kinds of fees. Something similar that you can achieve by analytical systems, but it was running inside the accounting system with the same consistency guarantees.

Accounts and Transfers in TigerBeetle are immutable. You can change credit and debit amounts by posting transactions. You can also update Account flags and set or unset AccountFlags.CLOSED by posting a pending transfer or voiding it. In practice, you often have more statuses for accounts to accept credits while rejecting debit operations, etc.

Tools

TigerBeetle gives us 3 fields where we can store arbitrary information for both Accounts and Transfers: user_data_128, user_data_64, user_data_32, containing 16, 8, and 4 bytes of information. There are other fields like ledger and code that can be used for some things, but generally they should represent the ledger (and the currency) and code to distinguish different account and transfer types. And there is one more field: the id field itself (128 bits or 16 bytes), which gives us uniqueness checks out of the box.

When your systems use integer IDs, it is a straightforward task. Most likely, you will have UUIDs, and it is easy to convert them to integers and back using Python:

>>> user_data_128 = uuid.uuid4().int
>>> uuid.UUID(int=user_data_128)
UUID('a1833b6a-a185-47aa-90ac-2f78979df3be')

If you have textual codes and identifiers, you can even store those with some encoding scheme:

>>> user_data_32 = int.from_bytes("EUR".encode())
>>> user_data_32.to_bytes(3).decode()
'EUR'

Further, TigerBeetle provides lookup_accounts and lookup_transfers to retrieve Accounts and Transfers by the id field. And there are query_accounts and query_transfers to query Accounts and Transfers by a combination of user_data_128, user_data_64, user_data_32.

Solution

Dealing with 1:1 relations is easy, just put our UUID in the id field.

1:n relations are harder. First, let’s use user_data_128 to store our UUID and link together multiple accounts by having the same value in that field. Second, there might be multiple cards and accounts for the same client UUID. For that, you can store the counter per client in the user_data_32 field.

You can find all cards and accounts of a client by running a query on the user_data_128 field. After choosing the desired card/account, you can retrieve the whole TigerBeetle account set for the card/account product agreement by running a query on user_data_128 and user_data_32.

Something like this:

import os
import uuid
from enum import IntEnum

import tigerbeetle as tb

class Code(IntEnum):
    MAIN_ACCOUNT = 1
    LIMIT_ACCOUNT = 2
    EUR = 978
    STATUS = 9000

def register_new_account_agreement(client_id: uuid.UUID):
    with tb.ClientSync(cluster_id=0, replica_addresses=os.getenv("TB_ADDRESS", "3000")) as client:
        existing = client.query_accounts(
            tb.QueryFilter(user_data_128=client_id.int, ledger=Code.EUR, code=Code.MAIN_ACCOUNT, limit=100)
        )

        seq = (max(account.user_data_32 for account in existing) + 1) if existing else 1

        accounts = [
            tb.Account(
                id=tb.id(),
                user_data_128=client_id.int,
                user_data_32=seq,
                ledger=Code.EUR,
                code=Code.MAIN_ACCOUNT,
                flags=tb.AccountFlags.LINKED,
            ),
            tb.Account(
                id=tb.id(),
                user_data_128=client_id.int,
                user_data_32=seq,
                ledger=Code.EUR,
                code=Code.LIMIT_ACCOUNT,
            ),
        ]
        account_errors = client.create_accounts(accounts)

There is a race condition between finding the highest value of the sequence number and assigning it to an account. But that can be solved by running account creation from a single thread or by serialisation through locks in Redis or somewhere else.

This solves the creation of agreements/account sets. But what about updates?

If all you have is a hammer, everything looks like a nail

You can’t modify any of the Account fields, but you can post a Transfer of 0 amount with no financial impact and store information in any of the Transaction’s user data fields. And to read back the current value, you can ask for the last Transfer of a specific type (code field) by using the limit=1 and tb.AccountFilterFlags.REVERSED flag. Like this:

        main, limit = accounts[0], accounts[1]
        transfer_errors = client.create_transfers(
            [
                tb.Transfer(
                    id=tb.id(),
                    debit_account_id=limit.id,
                    credit_account_id=main.id,
                    amount=0,
                    ledger=main.ledger,
                    user_data_128=0b1001001,
                    code=Code.STATUS,
                )
            ]
        )
        ...

        transfers = client.get_account_transfers(
            tb.AccountFilter(
                account_id=main.id,
                limit=1,
                code=Code.STATUS,
                flags=tb.AccountFilterFlags.CREDITS | tb.AccountFilterFlags.REVERSED,
            ),
        )
        print(transfers[0].user_data_128)

When one Transfer has too little fields, you can always post two, three or more and retrieve the same amount to reconstruct the information. Not that you should do it, but if two extra fields are the only reason to introduce an OLGP database, you might choose to abuse TigerBeetle to achieve the same.

To be continued about building a transaction out of multiple Transfers.

The lost art of semaphores

Aivars Kalvāns — Thu, 09 Oct 2025 00:00:00 +0000

I am a huge fan of System V Inter Process Communication primitives. There is some rawness and UNIX spirit to them. There is a newer and kinda “improved” version of those primitives named POSIX IPC. While there are a few things in POSIX IPC that can’t be done with System V IPC, most of the time it’s the other way around. Primarily due to the rawness of System V IPC. Let’s check the POSIX semaphores:

sem_post can be used to release a semaphore (increment by 1)
sem_wait can be used to acquire a semaphore (and decrement by 1)

System V IPC has a single call for that: semop. It can increment or decrement a semaphore by an arbitrary value. It also has the operation flags for each operation. And there, within flags, you can find one of the pearls of System V IPC - the SEM_UNDO flag.

What the SEM_UNDO flag does is add the operation to an “undo list” within the kernel. Whenever the process terminates because of natural causes or is brutally killed by SIGTERM, SIGKILL, out-of-memory killer, or other reasons, the kernel will revert the semaphore operation. Think about it - your process acquires a semaphore and gets killed while holding it, and it will prevent other processes from acquiring the semaphore again. With SEM_UNDO, you can choose what happens: if you used the semaphore as a counting semaphore, you can ask the kernel to release it automatically. When you acquire the semaphore to modify some shared resources, you can keep the semaphore stuck. It’s all up to you.

Which brings me back to a previous topic of tracking Gunicorn’s busy worker count. I used a semaphore there as a “reverse counting semaphore”: I released the semaphore (increment by 1) every time a process starts and acquired the semaphore (decrement by 1) every time a process stopped. But Python’s multiprocessing.Semaphore is a POSIX semaphore. When a worker gets killed by the OOM killer or dies, the semaphore is not decremented, and the worker count is incorrect.

So I decided to build my own Python wrappers around the System V IPC to fix this issue and also make my other System V IPC projects more enjoyable. It’s more fun to use Python for quick tests than C++ code. With the library, here’s how you count the Gunicorn’s busy workers.

First, we have to create a new semaphore. It’s just a number that can be shared with others. The downside of that - you have to perform a manual cleanup by scheduling the removal of the semaphore when the main process terminates:

import atexit
import sysvipc

sem = sysvipc.semget(sysvipc.IPC_PRIVATE, 1, 0o600)
atexit.register(lambda: sysvipc.semctl(sem, 0, sysvipc.IPC_RMID))

We can read the current value of the semaphore at any time with:

curval = sysvipc.semctl(sem, 0, sysvipc.GETVAL)

And then each process can increment the semaphore without confusing the reader with strange names like “unlock” or “post”. I also specify the SEM_UNDO flag, and the kernel will apply -1 to the semaphore when the process terminates for any reason:

sysvipc.semop(sem, [(0, 1, sysvipc.SEM_UNDO)])

Once the process is done with the work, I decrement the semaphore. Again - no confusing names like “lock” or “wait”. The SEM_UNDO will add +1 to the kernel’s semaphore adjustments and make the total adjustment 0. Past this point, when a process terminates, nothing will be subtracted from the semaphore value, and it will correctly represent the number of active workers.

sysvipc.semop(sem, [(0, -1, sysvipc.SEM_UNDO)])

And this is just the beginning, I need to write more Pybind11 wrappers for System V IPC to unlock more goodies in Python.

Talking to payment cards over NFC

Aivars Kalvāns — Sun, 28 Sep 2025 00:00:00 +0000

I had a great experience speaking about contactless payment cards at BSides Krakow. For those who want to get their hands dirty:

Slides are here and here are the code snippets.

Ring buffer in the database

Aivars Kalvāns — Tue, 23 Sep 2025 00:00:00 +0000

We had a requirement to display the last N transactions on the ATM screen (”mini-statement”). The simplest solution is to keep the list of transactions, order them by date, and take the newest N transactions. But it gets tricky once you realize there are active customers making several transactions per day and inactive ones who use the card occasionally and might make a transaction every couple of months. Which means you have to preserve transaction records for a long period, and queries run more slowly. For a larger customer base, even half a year of data might be challenging to query in real time.

Now we start thinking of keeping only the last N transactions per payment card and doing an INSERT followed by a clever DELETE that discards all extra transactions. In a regular programming language, a better solution would be to have a ring buffer with a fixed capacity and a “write pointer” that points to where the newest record should be stored, overwriting the oldest one. To implement something similar in the database, we would need locking to prevent concurrent updates between retrieving the “write pointer”, storing the record, and advancing the “write pointer”.

Years ago, I came up with a solution that I still have mixed feelings about. It is nice because it can be done with a single SQL statement, requires no explicit locking, and it works (in production as well). On the other hand, it is a bit fugly, my internal DBA demon complains about the size of the WAL records, and it’s weird. Old colleagues called that “shifter” because it works like the bit-shifting operations << and >>.

First, you create a model with as many message/transaction fields as you want to keep:

class History(models.Model):
    ...
    message1 = models.TextField(blank=True, null=True)
    message2 = models.TextField(blank=True, null=True)
    message3 = models.TextField(blank=True, null=True)
    message4 = models.TextField(blank=True, null=True)

Every time you want to store a new message, you assign it to the field message1. Whatever value was there in message1 you store it in message2. Whatever value was there in message2you store it in message3 , etc. And the value of the last field message4 in this case is forgotten.

And here’s how it works. We start with an empty ring buffer:

>>> History.objects.get(pk=1). __dict__
{'_state': <django.db.models.base.ModelState object at 0x721f90e8bf50>, 'id': 1, 'message1': None, 'message2': None, 'message3': None, 'message4': None}

We add the first message and verify it is stored correctly:

>>> History.objects.filter(pk=1).update(message4=F("message3"), message3=F("message2"), message2=F("message1"), message1="first message")
>>> History.objects.get(pk=1). __dict__
{'_state': <django.db.models.base.ModelState object at 0x721f90e8bfe0>, 'id': 1, 'message1': 'first message', 'message2': None, 'message3': None, 'message4': None}

We add the second message and verify it is stored correctly:

>>> History.objects.filter(pk=1).update(message4=F("message3"), message3=F("message2"), message2=F("message1"), message1="second message")
>>> History.objects.get(pk=1). __dict__
{'_state': <django.db.models.base.ModelState object at 0x721f90e8bfb0>, 'id': 1, 'message1': 'second message', 'message2': 'first message', 'message3': None, 'message4': None}

We add the third message and verify it is stored correctly:

>>> History.objects.filter(pk=1).update(message4=F("message3"), message3=F("message2"), message2=F("message1"), message1="third message")
>>> History.objects.get(pk=1). __dict__
{'_state': <django.db.models.base.ModelState object at 0x721f90ea8110>, 'id': 1, 'message1': 'third message', 'message2': 'second message', 'message3': 'first message', 'message4': None}

We add the fourth message and verify it is stored correctly:

>>> History.objects.filter(pk=1).update(message4=F("message3"), message3=F("message2"), message2=F("message1"), message1="fourth message")
>>> History.objects.get(pk=1). __dict__
{'_state': <django.db.models.base.ModelState object at 0x721f90ea81a0>, 'id': 1, 'message1': 'fourth message', 'message2': 'third message', 'message3': 'second message', 'message4': 'first message'}

We add the fifth message and verify it is stored correctly and the first message has been discarded:

>>> History.objects.filter(pk=1).update(message4=F("message3"), message3=F("message2"), message2=F("message1"), message1="fifth message")
>>> History.objects.get(pk=1). __dict__
{'_state': <django.db.models.base.ModelState object at 0x721f90ea82f0>, 'id': 1, 'message1': 'fifth message', 'message2': 'fourth message', 'message3': 'third message', 'message4': 'second message'}

So? Is this stupid or smart? After all these years I still have mixed feelings about it.

Tracking Gunicorn's busy worker count

Aivars Kalvāns — Tue, 26 Aug 2025 00:00:00 +0000

I was investigating performance issues of a Django application running with Gunicorn behind a Nginx server. First, I added more timing information to Nginx access.log:

log_format timing '$remote_addr - $remote_user [$time_local] '
                  '"$request" $status $body_bytes_sent '
                  '"$http_referer" "$http_user_agent" rt="$request_time" uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';

access_log /var/log/nginx/access.log timing;

After the Nginx reload, it started to report the total request time and the time waiting for a response from Gunicorn. I also checked the timing in Chrome developer tools. All the times matched, which means the network latency or Nginx was not to blame.

However, for a specific URL, response times were in the range of 3 to 22 seconds. The Gunicorn access log already contained the time spent processing the request by using a custom access log format string. And within the Gunicorn, those URL requests took less than a second. It was clear that requests get buffered between Nginx and Gunicorn in the connection backlog, and Gunicorn needs more workers to process the requests.

But how to find out how many Gunicorn workers are being used, how many are idle, and how to monitor that? I did not find a good answer. However, Gunicorn has functions that are called before a request is processed by the worker and after it has been processed. I could use that to maintain the total number of active requests. But how to do that across multiple processes and avoid the lost updates?

Meet the multiprocess Semaphore! It is a number living in the OS kernel memory: acquiring a semaphore decrements its value, releasing a semaphore increments the value, and a semaphore can’t become negative (acquiring will block). Normally, it is used for synchronization, but I will use it as an atomic gauge: release to increase it and acquire to decrease it.

Another trick I discovered: the Gunicorn access log can log the request headers. So instead of adding custom logs, I added a new HTTP header to the request and stored the counter value in it. And here is the complete solution for it:

from multiprocessing import Semaphore

accesslog = "-"
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" rt=%(L)s busy=%({x-busy}i)s'

busy = Semaphore(0)

def pre_request(worker, req):
    busy.release()
    req.headers.append(("x-busy", str(busy.get_value())))

def post_request(worker, req, environ, resp):
    busy.acquire()

Monolith First

Aivars Kalvāns — Tue, 08 Jul 2025 00:00:00 +0000

Many go to Martin Fowler for microservice architecture, distributed systems, micro frontends, event sourcing, and other fancy ideas about architecture, but a few have noticed the advice to do a monolith first:

As I hear stories about teams using a microservices architecture, I’ve noticed a common pattern.

Almost all the successful microservice stories have started with a monolith that got too big and was broken up

Almost all the cases where I’ve heard of a system that was built as a microservice system from scratch, it has ended up in serious trouble.

This pattern has led many of my colleagues to argue that you shouldn’t start a new project with microservices, even if you’re sure your application will be big enough to make it worthwhile.

Transactional task outbox in Django with django-taskq

Aivars Kalvāns — Tue, 01 Jul 2025 00:00:00 +0000

We have given up on distributed transactions (2PC) but have not given up working with multiple resources like the database, message brokers, and queues. Instead, everybody tries to build their own atomic operations over multiple resources. Some do code&pray, and others try to have a database as a source of truth and with the transactional outbox pattern

django-taskq uses the Django database as the one and only backend for storing tasks (function calls with parameters). Because it uses the Django ORM under the hood, it also obeys Django transactions:

from django_taskq.celery import shared_task

@shared_task(queue="kafka-events", autoretry_for=(Exception,))
def something_happened(*, key: str, value: str):
    ...

with transaction.atomic():
    model1.save()
    model2.save()
    something_happened.delay(key=str(uuid.uud4()), value=payload.model_dump_json())

Either all models will be updated and a new task scheduled or none of it will happen. And when the task runs, the model changes will be there in the database. I guess all of us have a Celery story about having a similar code and tasks executed before changes are committed and visible in the database. At which point everybody starts using transaction.on_commit that works most of the time while the broker keeps running, the network to broker is reliable and Redis or application is not being killed by OOM killer.

It still does not prevent failures while the task is being executed or the task not being idempotent but at least both the model changes and the task is recorded atomically and task failure or success will be recorded in the database.

A Philosophy of Software Design

Aivars Kalvāns — Thu, 26 Jun 2025 00:00:00 +0000

In a world full of Uncles Bob be John Ousterhout

More of method length, comment and TDD discussion and A Philosophy of Software Design should be on every bookshelf. It has some critcism but so does everything.

Serializable isolation level and transaction processing

Aivars Kalvāns — Wed, 25 Jun 2025 00:00:00 +0000

While on the topic of On-Line Transaction Processing Benchmarks, it’s interesting to observe the strategies companies employ to achieve optimal results. All code for both the transaction monitor and database is available in the PDF report. Let’s look at the Oracle one that uses Oracle Tuxedo and the Oracle database

There’s a lot of cryptic C code using the OCI interface and PL/SQL code blocks. But you won’t find any signs of pessimistic (SELECT ... FOR UPDATE) or optimistic (UPDATE .. WHERE version=:version) locking. How so? How can they achieve correctness without explicit locking?

Oracle decided they could achieve the best results by using a serializable transaction isolation level:

ALTER SESSION SET ISOLATION_LEVEL = SERIALIZABLE

Serializable isolation level in Oracle database means database will detect concurrent changes to table rows and return “ORA-08177: Cannot serialize access for this transaction”. This is a bit similar to optimistic locking except the database doing it for you. And then you can add retries in your code that repeats all updates until success. Like this payment code from TPC-C:

DECLARE /* payz */
    not_serializable EXCEPTION;
    PRAGMA EXCEPTION_INIT(not_serializable,-8177);
    deadlock EXCEPTION;
    PRAGMA EXCEPTION_INIT(deadlock,-60);
    snapshot_too_old EXCEPTION;
    PRAGMA EXCEPTION_INIT(snapshot_too_old,-1555);
BEGIN
    LOOP BEGIN
       BEGIN 
           UPDATE cust
               SET c_balance = c_balance - :h_amount,
               c_ytd_payment = c_ytd_payment + :h_amount,
               c_payment_cnt = c_payment_cnt + 1
               WHERE ...;
           UPDATE dist
               SET d_ytd = d_ytd + :h_amount
               WHERE ...;
            EXIT;
        EXCEPTION WHEN not_serializable OR deadlock OR snapshot_too_old THEN
            ROLLBACK;
            :retry := :retry + 1;
        END;
    END LOOP;
END;

So I used this approach in the early version of an accounting system. But performance tests on hot accounts were so bad, I had to give up. The reasons for that are similar to what Cliff Click shared about his experience with Hardware Transactional Memory and “perf counters” / “mod counters” leading to transaction retries.

With a serializable isolation level database is ignorant about what kind of changes your application makes. All updates are equally important for the database. But from the application point of view, a lot of changes are commutative and you don’t care in what order they happen:

incrementing transaction count
increasing account balance
decreasing account balance as long as it does not become negative

However, the database is not aware of that. When your transaction starts with the count of 42 and some other transaction manages to increment it before you, you get ORA-08177 and have to retry again. The database sees that some bits have changed and that’s all it cares about. But I don’t care if the transaction count is 43, 44, or 89 after I increment it as long as the current (whatever) value is incremented by 1.

The only way how serializable transaction isolation level can be faster for TPC-C tests is when the contention is relatively low. For me, I settled on using relative updates and sometimes optimistic locking for accounts where the balance was decreased or used to calculate fees and interest based on the exact current value. It was doing several entries per authorization along with cryptography, card checks, and network messages at 5,000 authorizations per second.

A broker-less distributed messaging system from the previous century

Aivars Kalvāns — Sun, 22 Jun 2025 00:00:00 +0000

When examining the On-Line Transaction Processing Benchmark, most people focus on the performance numbers and the database software. But there is another column named “TP Monitor” that lists the transaction monitor software. Before cloud-scale systems took over, the best performance numbers were achieved with Oracle Tuxedo (or BEA Tuxedo, before Oracle acquired it). The good results of Oracle database and Tuxedo led my previous company to choose them as the basis for payment card software in the late 1990s and early 2000s.

While Oracle Tuxedo is proprietary software, The XATMI specification is public. The main building block is an RPC call: you call a service by its name (svc), pass some data to it (idata, ilen), and receive data back (odata, olen):

int tpcall(char *svc, char *idata, long ilen, char **odata, long *olen, long flags);

Under the hood, it’s all just a wrapper around two API calls: one for sending a request to the service (tpacall) and the other one for waiting for a response (tpgetrply):

int tpacall(char *svc, char *data, long len, long flags);
int tpgetrply(int *cd, char **data, long *len, long flags);

And yes - the creators decided to skip the letter ‘e’ in tpgetrply while still having longer API names like tpadvertise, tpconnect and tpunadvertise. But unlike later specifications like CORBA, it made very clear which calls are RPC and you were forced to handle failures and timeouts.

But , the API itself is not that interesting. You can implement the API but still have a shitty XATMI implementation. Or you can do it like Tuxedo did and have something simple and elegant. So let’s look into that and maybe you can take some design lessons out of it.

Tuxedo was developed by AT&T along with UNIX so it used System V inter-process message queues, semaphores, and shared memory. Message queues are the foundation of Tuxedo and the most important function calls are msgsnd, which just copies data into kernel space, and msgrcv, which copies it back. It’s really that simple. But because the message queue lives in the kernel, messages live there as long as the kernel is running. Senders and receives can come and go, and the messages will stay there. There is no “broker” process as we expect nowadays that has to keep running or do persistence of messages to the storage. Kernel is the broker.

A request-reply pattern is built by the caller having its own response queue and the callee having a request queue. Each request message includes the response queue identifier where the response should be stored. Tuxedo implements timeouts waiting for the response, handles (ignores) late responses, and does other housekeeping.

Now queues are identified by a number however the Tuxedo works with nice service names. To map between service names and the queue identifiers, Tuxedo uses shared memory. Again - the shared memory is kept alive by the kernel and outlives all processes. However, all processes can access the memory to do the lookup of the service name to the queue identifier. Like a serverless name server.

To put it all together: the caller process#1 looks up the name of the service “FOO”. When “FOO” is not found or the queue does not exist, you get an error. When the queue is full, you can either wait or fail based on call mode. Once the request message is added to the queue, the caller process#1 proceeds to wait for a response. When the response is not received within the timeout, you get an error. On the callee side process#2 polls requests from queue#3. Once a request is received, it does the work and puts the response back into the reply queue mentioned in the request (queue#4).

Now what about the “distributed” part? Instead of adding a new transport for the service call, Tuxedo introduces gateways that connect multiple machines. On the caller machine, the gateway says it provides the “FOO” service. Once it receives the request, it forwards it using whatever transport protocol to the gateway on the other machine. On the other machine, the gateway acts as a caller and calls process#2.

Simple and nice, isn’t it?

My Friday's "old man yells at cloud" moment

Aivars Kalvāns — Fri, 06 Jun 2025 00:00:00 +0000

Casey Muratori had this to say:

You should never take library design advice from anyone who hasn’t had to make a living selling a library in a competitive arena.

I will rephrase it slightly and put it on the wall:

“You should never take software design advice from anyone who hasn’t had to make a living selling software in a competitive arena.”

“Never take” might be too strong but at least be skeptical. Too much advice and ideas come from people who get paid hourly or the project takes as long as it takes and costs as much as it does. Or they move on to the next shiny project as soon as it’s done and never look back. In a way, their assumptions about development, maintenance, costs, and extensibility never get challenged or measured. But but but the “developer performance”… compared to what?

Selling software or running it for years does the reality check. Does your methodology and “tech stack” leave a space for a profit margin? Do your state machines, event logs, and databases allow you to recover from bugs and incidents? Why extensive testing did miss those bugs? How often do you change the database, XML/JSON parser, or the (G)UI? Or things you put into configuration files because you might want to modify those later? How easy is it to modify and implement new features in the clean unit-tested OCP micro-service multi-az code?