Forem: Alex B

Adding an outbox, mail, tokens, CSRF, and more

Alex B — Thu, 05 Mar 2026 00:00:00 +0000

This post is a grabbag of updates from the last couple weeks, including a background task system, sending email, cross-origin request forgery protection, one-time tokens, and more.

outbox

Applications usually need a way to do work in the background. One reason for this is that it's helpful to avoid lots of work within the lifecycle of an HTTP request, especially when the work to be done depends on external services. For example, if registering a new user requires writing records to the database, sending an email, setting up a billing account, fetching/generating an avatar image, and provisioning some other resources, a failure in any one of those makes for very complex and fragile error handling – it works most of the time, until it doesn't, and then you can have a real mess on your hands.

That's why I prefer to have HTTP requests do as little work as possible, interacting with as few dependencies as possible, ideally just the database. The extra work of integrating with other dependencies is offloaded to a background processing system.

There are lots of ways to get that work submitted into the background system, lots of different job queue options.

For atlas9, I'm starting with a simple outbox pattern, where records are written to the database and to an "outbox" within the same database transaction. For example, I use this when registering a new user to create the user record, write a user.registered event, and create a send verification email task, all within the same database transaction. Having all the information recorded in the same transaction makes reasoning about failures easy – if anything fails, everything fails.

A worker runs in the background, picks up the send verification email task from the database, and sends the email. If emails were failing to send for some reason, it wouldn't affect user registration, and the task could be retried. The system supports multiple workers concurrently consuming the tasks, handling retries and failures, leases, idempotency keys, partition keys, and more.

The outbox interfaces look like this:

// (pseudocode: proper error handling omitted from this code for brevity) 

type Emitter interface {
    Emit(ctx context.Context, eventType, partitionKey string, payload []byte, opts *EmitOptions) error
}

func RegisterUser(ctx context.Context, email, password string) {

  // Start a writeable database transaction
  dbi.ReadWrite(ctx, s.DB, func(tx dbi.DBI) (core.ID, error) {

    users := s.Users(tx)
    passwords := s.Passwords(tx)
    outbox := s.Outbox(tx)
    user := &iam.User{Email: email}
    created := users.Save(ctx, user)

    payload := json.Marshal(UserRegisteredEvent{userID, email})
    outbox.Emit(ctx, "user.registered", userID.String(), payload, nil)
  })
}

type UserRegisteredEvent struct {
    UserID core.ID
    Email string
}

That writes an event to the outbox table:

sqlite> select seq, event_type, partition_key, payload from outbox;
seq event_type partition_key payload
--- --------------- -------------------------- -----------------------------------------------------------------
1 user.registered 06EBYDDXKFEYMKQY7CE8DF2VHM {"UserID":"06EBYDDXKFEYMKQY7CE8DF2VHM","Email":"test@eabuc.com"}
2 user.registered 06EBYDNG5PHC4E8BP741DQ5TWC {"UserID":"06EBYDNG5PHC4E8BP741DQ5TWC","Email":"test2@eabuc.com"}

Events record what happened. Events may be consumed multiple times but different consumers. This allows the "user.registered" event to be processed independently by multiple systems.

type Consumer interface {
    Claim(ctx context.Context, consumer string, leaseDuration time.Duration, limit int) ([]Claimed, error)

    Complete(ctx context.Context, seq int64, consumer string) error

    // Nack requeues a task for retry. 
  // If retryAfter is non-nil, the task won't be claimed again until after that time.
    Nack(ctx context.Context, seq int64, consumer string, retryAfter *time.Time) error

    // Fail permanently marks a task as failed, stopping all future attempts.
    Fail(ctx context.Context, seq int64, consumer string) error
}

// Claimed pairs an event with its task.
type Claimed struct {
    Event Event
    Task Task
}

// Task represents an event assigned to a consumer.
type Task struct {
    Seq int64
    Consumer string
    PartitionKey string
    Status Status
    ProcessAfter *time.Time
    LeaseUntil *time.Time
    CompletedAt *time.Time
    Attempts int
}

Going back to the early example, there may be independent consumers of the user.registered event for "sending verification email", "setting up billing", "fetching avatar image", "provisioning resources", etc.

When an event is recognized by a consumer, a "task" is created, which tracks the c

The outbox_deliveries table maps events to consumers:

sqlite> select * from outbox_deliveries ;
seq consumer partition_key status process_after lease_until completed_at attempts
--- ------------- -------------------------- --------- ------------- ------------------- ------------------- --------
1 welcome_email 06EBYDDXKFEYMKQY7CE8DF2VHM completed 2026-03-05 18:15:28 2026-03-05 18:14:58 1
2 welcome_email 06EBYDNG5PHC4E8BP741DQ5TWC completed 2026-03-05 18:16:28 2026-03-05 18:15:58 1

This is a decent first draft that captures some of the design elements I'm looking for:

write to an event log
multiple consumers
retries, leases, delayed execution (ProcessAfter)
partitions (isolate work by tenant, for example)
FIFO within a partition
fairness

These are some of the things I've typically needed/wanted from a message queue, that make is easy to reason about behavior, overload, failures, ordering, etc.

There are plenty of rough edges here to figure out:

event log and task cleanup (delete old rows)
rename some things (e.g. outbox_deliveries should be tasks probably)
rethink the consumer, worker, and task interfaces

The atlas9.dev/core/outbox/ package contains implementations for sqlite and postgres.

mail

Speaking of email, atlas9.dev/c/mail contains code for sending mail in various ways. It has an SMTP implementation, and an AWS Simple Email Service (SES) wrapper.

It's tempting to think, "AWS SES is super reliable, I don't need a background job", and you'd be right, SES is unlikely to fail (although I'm sure it does), but failures come from all sorts of dimensions: networking, config, bugs in code, etc. Having work captured in a task that can be retried helps avoid having to clean up messes when things go haywire.

This isn't always necessary, of course. For example, in the demo app, the "resend verfication email" and "request password reset" actions don't bother with a background task, they just send the email directly – if these fail, the user can just click the button again.

The demo app now has added some features: verification email, password reset, user profiles, an (untested) GitHub oauth provider.

tokens

There seem to be a lot of different uses for short-lived, single-use tokens: verify email, reset password, OIDC auth, magic link login, multifactor auth, etc.

So I added the atlas9.dev/c/core/tokens package to help create, store, and retrieve such tokens:

type Token[T any] struct {
    Key string
    ExpiresAt time.Time
    Data T
}

type Store[T any] interface {
    // Put stores data under the given key with a configured expiration.
    // If the key already exists, it is replaced (upsert).
    Put(ctx context.Context, key string, data T) (*Token[T], error)

    // Get retrieves a token by key without consuming it.
    // Returns core.ErrNotFound if the key doesn't exist or has expired.
    Get(ctx context.Context, key string) (*Token[T], error)

    // Delete removes a token by key.
    Delete(ctx context.Context, key string) error
}

One interesting thing I learned while building this is timing attacks: comparing two strings can take a different amount of time. If two 32 character strings differ at the second character, the comparison is going to stop after to comparisons, but if the strings differ only at the last character, the comparison is going to take slightly longer because it needs to do 31 comparisons. Apparently, even with the noisy nature of networks, and with the noisy load on database and servers, with enough datapoints it's possible to derive a token value by measuring the time it takes to check (invalid) tokens (which is indirectly measure the time it takes for, say, postgres to look up a string in btree index). Pretty wild!

To protect against that, you separate the token key from the token code. When checking a token, you look up the payload using the key (which is subject to timing analysis) but you verify the token against the code using a constant-time comparison function like Go'scrypto/subtle.ConstantTimeCompare().

The details are in https://atlas9.dev/src/core/tokens/secure.go.

dropping the clock

I thought having a Clock interface would be a good way to remind myself to write testable code that depends on time (instead of using time.Now() directly), but the Go team recently solved that extensively with https://go.dev/blog/testing-time.

csrf protection

I spent some time setting up protection against cross-site request forgery using a variety of tricks (cookies, headers, hidden form tokens, etc), but then I discovered that the Go team, again, solved this much better than I could with:https://github.com/golang/go/issues/73626.

Alex Edwards does a great job explaining the situation here: https://www.alexedwards.net/blog/preventing-csrf-in-go. I learned a ton about cross-origin request security, why it matters, why the old approaches to CSRF protection are insufficient, what browsers have done about it, and why it's actually CORF :)

Publishing the core

Alex B — Thu, 26 Feb 2026 00:00:00 +0000

After many weeks of hemming and hawing, going to and fro, I decided it's time to put some actual code out into the world.

A lot of code has been written, lots of experiments and ideas, notes, scribbles, thoughts, questions, and hundreds of millions of tokens (226M apparently!) worth of conversation with Claude. It's been fun and educational to explore topics in depth, reflect on all I've learned over the last many years, all the struggles, and be reminded of how little I know still. Claude has been a great partner on this journey, even though I rewrote almost all of the code it produced (I'll write more about that experience at the end of this post).

After all that, I needed to climb out of the rabbit hole, create an empty directory, and pull out the pieces that seemed reasonable.

Here's what's published:

a core Go module at atlas9.dev/c/core
non-core modules at:

Browse the source at atlas9.dev/src and clone the source with git clone https://git.atlas9.dev/atlas9.git.

There's also a demo app (src) that pulls the pieces together to provide a simple app with login, registration, passwords, user profiles, and OIDC login with Google and Apple. I'll be expanding this demo app as I add more features to the atlas9 modules.

Core

Let's look at some parts of the core.

There's an ID type:

id := core.NewID()
println(id.String()) // 04JE0MSNG09VXY4ZRMCMZF244G
core.ID{}.IsZero() == true

The ID type implements various interfaces that make it easy to (un)marshal from json, text, sql databases, etc. I wrote about designing this ID previously here.

There's a Clock interface, which exists to remind us to make time.Now() testable:

type Clock interface {
    Now() time.Time
}

There are a couple small types related to pagination, which I hope helps keep pagination consistent, and reminds us to implement pagination from the start. This might move into a generic "crud" helper in the future.

type Page[T any] struct {
    Items []T
    Cursor string
}

type PageReq struct {
    Limit int
    Cursor string
}

Path, Guard, and access control

The Path type represents the path to a resource. Most applications have some hierarchy of resources – you might have an "organization", and org can have multiple "team" containers. So if the "dev" team in the "acme" org stores a document, that might be represented as a (Path, ID) tuple like ("acme.dev", "04JE0MSNG09VXY4ZRMCMZF244G"). Or maybe you have a deeper hierarchy like "acme.eng.dev.contractors". A path is a string in a dot-separated format that describes the location of a resource in the application's hierarchy. Paths are used for determining access – you might grant Bob access to all resources in the "acme.eng.dev" path, for example.

Path isn't used in the demo app yet and needs more development.

core/iam.Guard is an interface that represents an access check.

type Guard interface {
    Check(context.Context, Action, core.Path) error
}

// the context would hold information about the current principal (user)
ctx := context.Background() 

// a real application would implement the guard interface for its access control model,
// or use an implementation provided by atlas9.
g := iam.AllowNone{}

// the Action type denotes that this string is an action used in an access control check.
act := iam.Action("example action")

// guard checks operate on paths – access is granted based on paths.
p := core.Path("acme.eng.dev")

g.Check(ctx, act, p) == iam.ErrForbidden

The ideal convention would be to prefer doing guard checks at lower levels, in the Store implementations, close to where the database is being accessed for example, to help ensure that requests don't accidentally gain access to resources they shouldn't have access to via references.

type DocStore struct {
  db dbi.DBI
  guard iam.Guard
}

var ActionGetDoc iam.Action = "get doc"

func (ds *DocStore) GetDoc(ctx context.Context, id core.ID) (*Doc, error) {
  var doc Doc
  err := dbi.Get(ctx, ds.db, &doc, `SELECT ... FROM documents ...`, id)
  if err != nil {
    return nil, err
  }
  if err := ds.guard.Check(ctx, ActionGetDoc, doc.Path); err != nil {
    return nil, err
  }
  return &doc, nil
}

Path, Guard, and friends need more thought and work. Access control is such a broad, complex topic, and applications have a wide variety of needs, that I've had a very hard time landing on something I feel good about. I don't love the design yet.

Guard, at the very least, will need to support list/query filtering, so it might get a Filter(ctx, action, path) Filters method in the future.

I'm certain that whatever I come up with won't work for some use cases, and that's ok, it should be easy to ignore and leave unused. I hope to find something that works nicely for common application designs though.

Users, Principals, Sessions, Passwords, OIDC

The core/iam package has types and interfaces for common identity concepts like users (represents a human user with a login, email, etc), principal (the identity of a caller, could be a user, a machine, etc), and sessions. It has some helpers for creating and verifying passwords.

Many applications have a "sign in with Google/Apple/Github/etc" feature, so core/iam provides an interface for storing identity providers for users, and the iam/oidc_provider module provides implementations for common identity providers.

In general, I've tried to keep the core light – core defines the concepts, the types and interfaces, and mostly leaves concrete implementations and their dependencies to other modules.

Routes

core/routes contains my preferred approach to building APIs – RPC instead of REST. My RPC design uses POST requests for everything, with no information encoded into URLs. I like the consistency and constraint of RPC compared to REST, so that I don't have to think about which request method to use, what information to put into the URL or query parameters, etc.

type Route struct {
    Pattern string
    Handler http.Handler
    ReqType any
    ResType any
}

// RPC endpoint
func RPC[T, U any](path string, handler RpcHandler[T, U]) Route

// Standard HTTP handler
func HTTP(pattern string, handler http.HandlerFunc) Route

// Static files
func Static(mux *http.ServeMux, filesys fs.FS, pattern, filepath string)

Go generics and routes.RPC allow me to write an API endpoint more like a standard function, with the request and response types clearly visible in the function signature. routes.Route provides routes as data – I have some unpublished work that can turn these routes into OpenAPI specs, generate clients, terraform, etc.

Unfortunately, the demo app doesn't demonstrate routes yet, because the app uses forms and url-encoded bodies instead of javascript and RPCs. So lots more work to do in this area. I'll write more about this in the future.

Database interface (DBI)

core/dbi grew out of the need to start read-write and read-only transactions.

userID, err := dbi.ReadOnly(ctx, p.DB, func(tx dbi.DBI) (core.ID, error) {
    users := p.Users(tx)
    passwords := p.Passwords(tx)

    user, err := users.GetByEmail(ctx, email)
    ...
})

And it will likely include a few helpers for common patterns, for example:

func (s *SqlitePasswordHashStore) GetHash(ctx context.Context, userID core.ID) (iam.PasswordHash, error) {
    var hash iam.PasswordHash
    err := dbi.Get(ctx, s.db, &hash, `SELECT hash FROM password_hashes WHERE user_id = $1`, userID)
    return hash, err
}

dbi.Get handles the common dance of running a query, scanning the result into a struct, handling errors, etc.

One design decision here worth pointing out is that "Stores" (the code that contains the database access) are created per-transaction by convention:

dbi.ReadOnly(ctx, p.DB, func(tx dbi.DBI) (core.ID, error) {
    users := p.Users(tx)
    passwords := p.Passwords(tx)
})

This allows multiple orthogonal application domains to be used within one transaction, which helps keep the store interfaces and implementations small and focused. The tradeoff is a small amount of code and a "factory" concept for creating the store instances.

This also fits well with Guard or any other request-scoped information that a store would need.

On hosting code

Github is great, but I wanted to do something different this time. It's not simpler or easier or better (arguably). It took much more effort to figure out how to set up, the file browsing is bare bones, etc.

It does feel somewhat empowering to have everything on my server, under my domain. I had fun setting it all up. I learned a lot. It (barely) provides what I need right now and no more. I don't need Actions, Discussions, Issues, Pull Requests, Releases, Insights, etc. It's nice to get back to the roots of the internet – have a server, put files up, configure things, own it.

Perhaps it's just that, I can.

Why do you want to climb Mt Everest? Because it's there. – George Mallory

On building with Claude

There's a lot of talk and turmoil about the role of LLMs in software (and everything else). There are those that love it, those that hate it, those that worry about it. So I thought I'd share my experience.

Claude has been a great partner on this atlas9 journey, but not because it's good at writing code. It's actually written mostly code that I was unhappy with, which is somewhat surprising to me. I've rewritten most code and docs Claude has written. Aside from code, I've also had long conversations where I needed to refine the initial plan proposed by Claude. If I had just "vibe coded" it and let Claude go off and do whatever it came up with first, I expect things would be quite messy.

Claude often seems to overdo it. It will add bits and pieces that are unnecessary or were not discussed, but that may be common in software in general. The result doesn't feel refined. I'm sure this is, in part, due to how I'm using Claude. It's great for some code though – the Pages code in the demo is a good example of a few hundred lines of tedious code I'd rather not type out (pulling form values, accessing storage, handling errors, passing back to render, etc), and that doesn't need to be particularly refined.

Claude is fantastic for learning new things. I have spent many hours deeply learning about postgres, access control, etc. The thought of learning those things by browsing the internet, with its messy pages and bad writing, sifting through SEO and clickbait garbage, dealing with popups and cookie prompts...ugh. It makes me cringe. Claude can almost instantly answer challenging questions I have on complex topics in a direct and concise way, without all the noise.

Claude is, for me, best for working through ideas. I can write a few dozen words about an idea I have, and then I can iterate on that idea for hours. It can be really hard past an initial idea sometimes – like a writer getting past the blank page. Having Claude sketch out some detail allows me to work with something less abstract, figure out what feels good and bad, reject parts and refine it. I've also loved asking Claude, "what else?" – after hours of pondering an idea, when my brain is drained and I'm deep down the rabbit hole and I have tunnel vision, Claude can give fresh perspective and summarize any gaps or tradeoffs.

Maybe LLMs will have another leap in the future and they'll write better code and design better applications or architectures than I could with no guidance from me, but so far that hasn't been my experience. They still need a driver. Perhaps software engineers have a perspective that is the distillation of years of experience, and perhaps LLMs haven't quite reached that level of distilled knowledge yet. They probably will though, which is crazy and scary and exciting.

Conclusion

I'm happy to finally get some real code published. Atlas9 feels slightly more like a real thing now.

And if you actually read all this and made it to the end here, thank you! If you have thoughts or want to chat, email me at blog@atlas9.dev.

Until next time!

My setup for integration tests in Go with embedded-postgres

Alex B — Tue, 10 Feb 2026 00:00:00 +0000

Today, I set out to write some integration tests. Since I'm using Postgres, one hurdle people often run into is how to get a smooth, fast experience while running tests against a test database.

A lot of people like testcontainers or running a long-lived Postgres container in Docker. I personally don't like the overhead of this extra machinery – it seems simple, but in my experience it ends up being fragile somehow, and needs tending and troubleshooting.

So, I want to try embedded-postgres, which is as close to our (sqlite) dream as we get in the Postgres world. (side note, I still often dream of a more official, lightweight, embedded Postgres solution). This project works by downloading Postgres binaries (from Maven, apparently) and giving you some Go functions to start/stop a Postgres server.

The nice thing about this approach is you can (ideally) open some Go code for the first time in VSCode, find a test, click the green "run" arrow, and it should just work – no docs to read, no make/just/bash scripts to run, no docker to install/manage, etc.

And it did work for me on the first try, but it was slow. If I was iterating on a single test, running that test frequently after making changes, I was seeing ~20 seconds per test run. That felt unbearably slow.

At first, I thought it was Postgres setup time. I attempted to tackle that by setting a persistent data directory, so that Postgres wouldn't spend time in initdb.

testDB = embeddedpostgres.NewDatabase(embeddedpostgres.DefaultConfig().
            Port(port).
            DataPath("./data/pg-embedded").
            Database("test_db"))

I was surprised when that wasn't the answer – it helped a little, but I was still seeing ~18 seconds per run.

After digging around (and with plenty of help from Claude, of course), I found that most of the time was spent extracting the compressed Postgres binaries. Apparently embedded-postgres caches the downloaded files, but it will extract the archive on every run (github issue is here).

You can change this behavior by setting a BinariesPath config:

testDB = embeddedpostgres.NewDatabase(embeddedpostgres.DefaultConfig().
            Port(port).
            DataPath("./data/pg-embedded").
            BinariesPath("./data/pg-embedded-bin").
            Database("test_db"))

With that, my test time dropped to ~1 second. The first test pays the cost of setting up the database, but after that I reuse the connection, so tests drop to ~0.1 seconds each. I'm running the migrations once on startup.

I added some extra configuration to remove logs and other overhead from Postgres:

        testDB = embeddedpostgres.NewDatabase(embeddedpostgres.DefaultConfig().
            Port(port).
            StartParameters(map[string]string{
                "fsync": "off",
                "synchronous_commit": "off",
                "full_page_writes": "off",
                "checkpoint_timeout": "86400", // once per day, effectively never
                "log_checkpoints": "off",
                "log_connections": "off",
                "log_disconnections": "off",
                "autovacuum": "off",
                "log_min_messages": "PANIC", // suppress almost all log output
            }).
            DataPath("./data/pg-embedded").
            BinariesPath("./data/pg-embedded-bin").
            Database("test_db"))

I don't have this in CI yet – some extra gymnastics will be required to keep the cached+extracted binaries available in CI, as usual.

Personally, I love high-level testing against the actual API and database – it gives me confidence that the feature actually works when all the pieces are put together, and I find it useful to be able to reproduce a reported issue by writing a test that matches the high-level steps to reproduce from the report.

Digging into UUID, ULID, and implementing my own

Alex B — Thu, 05 Feb 2026 00:00:00 +0000

I needed to figure out what ID type to use for atlas9. Initially, I reached for UUIDv7, which seemed like the obvious choice: it's the new version of UUID that incorporates years of experience and wishes. It's sortable, which seems nice, and probably plays nicely with databases – I'm guessing that if your UUID is a primary key, new IDs will be inserted closer to each other, rather than spread randomly over the index, which probably helps with caching and pages and that kind of stuff.

But then I started to experiment with ltree. ltree doesn't support hyphens in labels, so I would need to convert/remove them somehow. It wouldn't be terrible if the ltree path didn't match the IDs exactly, but it's not ideal either. I started wondering if maybe, for consistency, I should just remove hyphens from all my IDs, that way IDs in the database, logs, etc always look the same. That was the slippery slope that sent me way down this rabbit hole.

So, if I wanted consistent ID strings, I could just remove the hyphens. UUIDs are 128-bit (16 byte) numbers. In their string form, they're represented by 36 characters (including hyphens, 32 without). Postgres has a uuid type that can store the 16 byte version, which saves some space per ID compared to the string form.

I started to wonder if there was a more compact string form. I came across some suggestions to use base58 or base32 or other bases. Base58 encodes a UUID into 21 characters. I discovered Crockford's Base32, which is interesting because it's URL-friendly and also tries to be human-friendly by excluding characters that might look the same like l and 1 – "Be pronounceable. Humans should be able to accurately transmit the symbols to other humans using a telephone.". I don't think anyone is reading these IDs character by character (especially over the phone), but it's a fun find nonetheless.

Along the way I found other ID specifications like ULID. These have similar goals: sortable, compact, URL-friendly. Some of them have had bugs – for example, some were designed to be case-sensitive, which created issues with sorting in Postgres due to collation settings.

I also dug into the implementation of UUIDv7 in the google/uuid and the UUIDv7 spec itself, and I learned some interesting things:

they generate random data in blocks, perhaps because generating random data with crypto/rand is expensive enough to make it worth it
the UUID format contains fields that specify the version and variant
the time precision (because UUIDv7 is time-sortable) is variable: the spec defines at least 1 millisecond resolution, but implementations can extend this. The google/uuid library uses fractional milliseconds: getV7Time returns the time in milliseconds and nanoseconds / 256, and has some extra logic to ensure IDs are monotonic.
the crypto.rand.Read function never returns an error and always fills the buffer entirely (docs)

The UUIDv7 layout:

/*
         0 1 2 3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        | unix_ts_ms |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        | unix_ts_ms | ver | rand_a (12 bit seq) |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |var| rand_b |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        | rand_b |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    */

I spent a few hours going down this rabbit hole, learning all the things people have considered when designing these ID formats. At some point, I started writing my own implementation into atlas9. Initially, I just wanted to avoid the dependency on google/uuid – the code is fairly small and straightforward, so avoiding an external dependency seemed reasonable.

As I implemented ID generation, I realized I don't really need some of these features: monotonicity, introspectable IDs (version and variant).

I don't need IDs to be monotonic. Libraries put in extra effort to ensure that IDs generated by the same instance are monotonic (always increasing). Maybe I'm missing something, but I don't see the point. You most likely have multiple instances of a service spread across multiple servers and there's no guarantee they are monotonic, so I don't quite understand why it's important that IDs are strictly monotonic. I suppose if you want to rely on IDs for ordering, and you don't want to pass along a high-resolution timestamp, and you're ok with relying on system clocks for ordering guarantees (which is bold, perhaps).

So anyway, I decided to keep the code dead simple and drop monotonicity. I also dropped the bits used for variant and version from the UUID spec, because I don't need IDs to be introspected like UUID does. That provides a few more bits for random data (or could be traded for higher resolution timestamp). The current implementation uses 1 millisecond resolution (just a unix timestamp). IDs are encoded in Crockford's Base32. The IDs encode to a 26 character string.

Initially, I stored the ID as a slice of bytes (type ID []byte), but it seemed silly to be constantly encoding and decoding that to and from a string (which happened all over the codebase: APIs, database, access checks, logging, etc), so I ended up just storing it as a type ID string.

I'm a bit sad that I lost the compact, 16-byte representation in Postgres. I did look into the CREATE DOMAIN and CREATE TYPE statements in Postgres – seems like you can define your own types with tons of detail – but it started to feel like too much, and was over my head, and also very Postgres-specific. Fascinating features though!

I'm not super worried about compact byte size anyway – with 1 million rows, it's what like 10 megabytes? Probably more with indexes and such, but 🤷, if that many bytes are truly a problem, maybe UUIDs are the wrong choice anyway (int64 would be used maybe).

So, after all that, I'm not certain what I made is better, but I think it will work.

Sometimes engineering is like driving to the grocery store – you've been there hundreds of times, you know exactly where you're going and what's needed. Sometimes, like this time, it's more like Lewis and Clark – venturing out west somewhere, with a vague destination, looking for lots of help from others along the way, until you end up somewhere and it's all ok.

Here's the full implementation of generation (encoding is implemented elsewhere):

> NewID()
06E2MKE0YP14ENMYC5X2Q3E34C

func NewID() ID {
    var data [16]byte
    t := time.Now().UnixMilli()

    data[0] = byte(t >> 40)
    data[1] = byte(t >> 32)
    data[2] = byte(t >> 24)
    data[3] = byte(t >> 16)
    data[4] = byte(t >> 8)
    data[5] = byte(t)

    poolMtx.Lock()
    if poolPos == poolSize {
        // https://pkg.go.dev/crypto/rand#Read
        // Read fills b with cryptographically secure random bytes.
        // It never returns an error, and always fills b entirely.
        // Read calls io.ReadFull on Reader and crashes the program irrecoverably
        // if an error is returned. The default Reader uses operating system APIs
        // that are documented to never return an error on all but legacy Linux systems.
        _, err := rand.Read(pool[:])
        if err != nil {
            poolMtx.Unlock()
            panic(err)
        }
        poolPos = 0
    }
    copy(data[6:], pool[poolPos:poolPos+10])
    poolPos += 10
    poolMtx.Unlock()
    return ID{str: encodeBase32(data)}
}

Modeling identity and access hierarchy in Postgres with ltree

Alex B — Wed, 28 Jan 2026 00:00:00 +0000

Can Alice edit this blog post? Can Bob comment on that document? Can product executives modify the draft quarterly report? How do applications answer these questions about who can do what?

Essentially, the question is can user U take action A on resource R? – can Alice (user) edit (action) this blog post (resource)?

You could imagine keeping a grants table in your app that has a row for every granted permission, each grant is a tuple of (user, action, resource), so (alice, edit, this blog post) grants alice permission to edit this post.

When there are lots of users, actions and resources, you might want to group them. Groups of actions might be called roles. Users might be grouped into teams, orgs, or divisions. Resources might be grouped into workspaces or accounts. With groups, you can grant access in bulk: the marketing team can edit all documents in workspace 1.

You might even want groups of groups: the DX team and the SRE team are both Infra teams, and Infra teams have admin access to account 123. Some teams and users have multiple parents: "the principal engineers are on both the Infra team and the DX team" or "Pete is on the Sales team but also reports directly to the CEO".

What if you want dynamic groups? Can users define their own teams, roles, or workspaces? Grouping users is common – any medium-sized organization has some internal data system for organizing people into groups (LDAP, Active Directory, Okta, etc). It's nice when those can be synced into applications.

Some applications support flexible, nested groups of resources (e.g. Google Drive), but most applications suffice with shallow, predefined, coarse groups like "workspace" or "organization". And that can work just fine – take GitHub for example, it does a lot with "organization" and "repository".

User-defined roles seem less common – most applications suffice with predefined, coarse groups of actions. Although, using GitHub as an example again, they have slowly introduced more flexible, fine-grained control over time. In my experience, inflexiblity in roles is a common pain point for both users and product designers.

While pondering how to organize identity and access in atlas9, I've been thinking about whether it can (or should) provide support for flexible, fine-grained, dynamic groups of all three: users, actions, and resources. I have sometimes heard users and product designers wish they had more fine-grained control

"I wish I could grant this token only these very specific actions, instead of a general editor role"
"I wish I could grant contractors access only to edit alert configurations"
"These resources are in account 123 and I can't move them, but I want to grant team X access to only a subset"

So, how do I model nested groups of users, actions, and resources? How do I execute access checks against all that information? How do I organize all that data in Postgres? Is it efficient enough?

ltree

Postgres has a cool, core extension called ltree - hierarchical tree-like data type. There are a lot of great blog posts and websites that talk about hierarchical data and ltree (and alternatives) in better detail than I can write:

ltree provides an ltree data type which efficiently stores a path in a tree-like structure – if Sally is on the Marketing team in the GTM org, you could store that path as gtm.marketing.sally.

CREATE EXTENSION ltree;

CREATE TABLE teams (path ltree, name text);
INSERT INTO teams (path, name) VALUES ('gtm.marketing.sally'::ltree, 'sally');
-- sally is also on the product design team
INSERT INTO teams (path, name) VALUES ('product.design.sally'::ltree, 'sally');
INSERT INTO teams (path, name) VALUES ('product.design.sam'::ltree, 'sam');
INSERT INTO teams (path, name) VALUES ('gtm.marketing.tom'::ltree, 'tom');
INSERT INTO teams (path, name) VALUES ('eng.infra.bob'::ltree, 'bob');
INSERT INTO teams (path, name) VALUES ('eng.infra.rob'::ltree, 'rob');

To get everyone in the GTM org:

> select * from teams where path <@ 'gtm';
        path | name
---------------------+-------
 gtm.marketing.sally | sally
 gtm.marketing.tom | tom

And if you wanted to get all the teams Sally is on, you could query for name = 'sally' and then split the paths and determine that Sally is in gtm, gtm.marketing, product, product.design. Imagine that each of those levels could have different access to different resources.

Multiple parents, materialized paths, and modifications

I won't go into too much detail here, but note that in a hierarchy where nodes can have multiple parents (a DAG), you're materializing the full path to all leaf nodes. You can see Sally has rows for both gtm.marketing.sally and product.design.sally. Modifications require extra care to maintain consistency across all rows.

For example, if you remove Sally entirely, you have to find all rows related to Sally. If you remove Sally from the product design team, you have to remove only that path. If you were ever to move the design team to the ux org, you'd have to rewrite the relevant rows so that product.design.sally becomes ux.design.sally and product.design.sam becomes ux.design.sam.

If you had a large company, some changes could update a lot of rows. Even without multiple parents, if you were modeling a large filesystem for example, moving a directory could require rewriting a lot of paths.

So, ltree is great for querying hierarchical data, and in some cases, less great for updating large subtrees (e.g. compared to an adjacency list).

Directory, groups, grants, and roles

In the current prototype, resources are organized in a filesystem fashion, and backed by a table called directory, which keeps a path for all resources.

> select * from directory;

 id | path | name
-----+----------------------+-------------
 bp1 | posts.gtm.marketing | Blog Post 1
 bp2 | posts.product.design | Blog Post 2

Currently, resources are not allowed to have multiple parents, because I'm not sure yet how that would relate to access control and other filesystem operations.

Roles are stored in the database, and each role is an array of actions. Roles have no hierarchy:

> select * from roles;

   id | actions
--------+-------------
 editor | {view,edit}
 viewer | {view}

Groups are stored in hierarchy using ltree paths:

> select * from groups;

      path | name
----------------+-------
 gtm.marketing | bob
 gtm.marketing | sally
 product.design | sam

And finally, the grants table connects all three. Each row grants a principal (user or group) a role (list of actions) on a resource (directory node, could be a "folder" or a resource):

> select * from grants;

   principal | role | resource
---------------+--------+----------
 gtm.marketing | editor | bp1

Access checks use all this information to make a decision:

When a request comes in, the user is loaded from the session.
The user's groups are loaded.
The directory tree for the requested resource is loaded.
Grants related to both the resource's directory tree and the user's groups are loaded.
Roles related to the relevant grants are loaded.
An authorizer checks whether the user or any of their groups has the requested action on the requested resource or any of its parent directories.

For example:

Bob wants to edit blog post 1: action: edit, resource: bp1
Blog post 1 is stored in the posts.gtm.marketing directory
Bob on the marketing team: user.groups: gtm, gtm.marketing
The marketing team has been granted the editor role on the marketing posts directory: grants: (marketing, editor, posts.gtm.marketing)
The editor role includes the edit action: editor: view, edit, etc
The authorizer gets the request (bob, edit, bp1)
The authorizer pulls all the information together to see that Bob in on the Marketing team, which has been granted the editor role on a parent directory of the blog post, and the role contains the edit action. All good.

Cedar

I wrote about Cedar previously, and I'm still interested in using it to capture access rules. In this design, Cedar is the "authorizer" – the application queries all the relevant information (user, groups, grants, directory tree, etc), which it passes off to Cedar to make the final decision.

I spent some time wondering whether Cedar needs to play a role at all. After all, the application and the structure of the information does a lot of the work. You could imagine an authorizer that just flattens all the actions from the relevant grants+roles and searches for the requested action. In fact, that's roughly how most homegrown auth code I've seen works (although with more hard-coded information and less coming from the database). And in the current design, the authorizer is an interface, so the implementation could be swapped out to do just that.

I think Cedar might still be useful though. There are other rules which fall outside of the structure laid out above, rules like "the owner of a resource has all permissions" or "viewers cannot view draft documents" or "anonymous users can view published posts", etc.

Those rules can be defined in Cedar and linked to the application using Cedar's policy templates:

permit(
        principal, 
        action in BlogPost::allActions, 
        resource is BlogPost
)
when { principal == resource.owner }

permit(
        principal in ?principal,
        action == BlogPost::view,
        resource in ?resource
)
unless { resource.draft }

permit(
        principal == User::anonymous,
        action == BlogPost::view,
        resource is BlogPost
)
when { resource.published }

But, those rules could also just be encoded in the application itself. I'm not entirely sure yet whether it will be worthwhile to involve Cedar. Writing these rules in Cedar requires loading the relevant attributes anyway (e.g. loading published and draft into Cedar for processing), so at this time, Cedar feels like an unnecessary indirection. More experimentation is needed, and I need to build some real apps with this to see whether it adds values in the long run.

Onward

I've been trying hard to make this identity and access control foundation in atlas9 robust and flexible, because it underlies everything else – all operations need to do lots of different access checks to keep data safe, and features and users (I think) want the flexibility to organize and control access in ways that make sense to them, in ways that suit their specific needs.

I hope this design isn't overly flexible to the point of being complicated and/or inefficient. Time to build some features with it and see!

The challenges of soft delete

Alex B — Tue, 20 Jan 2026 00:00:00 +0000

Software projects often implement "soft delete", maybe with a deleted boolean or an archived_at timestamp column. If customers accidentally delete their data, they can recover it, which makes work easier for customer support teams. Perhaps archived records are even required for compliance or audit reasons.

I've run into some trouble with soft delete designs. I'll cover those, and ponder ideas for how I'd build this in the future.

Adding an archived_at column seems to ooze complexity out into queries, operations, and applications. Recovering deleted records does happen, but 99% of archived records are never going to be read.

So, the database tables will have a lot of dead data. Depending on access patterns, that might even be a significant amount of data. I've seen APIs that didn't work well with Terraform, so Terraform would delete + recreate records on every run, and over time that led to millions of dead rows. Your database can probably handle the extra bytes, and storage is fairly cheap, so it's not necessarily a problem, at first.

Hopefully, the project decided on a retention period in the beginning, and set up a periodic job to clean up those rows. Unfortunately, I'd bet that a significant percentage of projects did neither – it's really easy to ignore the archived data for a long time.

At some point, someone might want to restore a database backup. Hopefully that's for fun and profit and not because you lost the production database at 11 am. If your project is popular, you might have a giant database full of dead data that takes a long time to recreate from a dump file.

archived_at columns also complicate queries, operations, and application code. Applications need to make sure they always avoid the archived data that's sitting right next to the live data. Indexes need to be careful to avoid archived rows. Manual queries run for debugging or analytics are longer and more complicated. There's always a risk that archived data accidentally leaks in when it's not wanted. The complexity grows when there are mapping tables involved.

Migrations have to deal with archived data too. Migrations may involve more than just schema changes – perhaps you need to fix a mistake with default values, or add a new column and backfill values. Is that going to work on records from 2 years ago? I've done migrations where these questions were not trivial to answer.

Restoring an archived record is not always as simple as just running SET archived_at = null – creating a record may involve making calls to external systems as well. I've seen complex restoration code that was always a buggy, partial implementation of the "create" API endpoint. In the end, we removed the specialized restoration code and required all restoration to go through the standard APIs – that simplified the server implementation, and ensured that old data that had since become invalid, could not be restored incorrectly – it needs to pass the new validation rules.

I'm not a fan of the archived_at column approach. It's simple at first, but in my experience, it's full of pitfalls down the line.

Let's look at some alternatives (in PostgreSQL): application events, triggers, and logical replication.

All these approaches store archived data separately from live data – that may be a separate database table, a separate database, object storage, etc.

Application level archiving

One team I worked with took the approach of emitting an event at the application layer when a record was deleted. The event was sent to SQS, and another service would archive that object to S3 (among other things).

This had a few big benefits:

The primary database and application code were substantially simpler.
Deleting a resource involved cleaning up resources in various external systems. Handling this in an async background system improved performance and reliability.
The record and all its related records can be serialized to JSON in an application-friendly layout, rather than a serialized database table layout, so it's easier to work with.

The tradeoffs:

It's more likely to have a bug in the application code, and indeed this happened more than once, which meant archived records were lost and manual cleanup of external resources was necessary.
It's more infrastructure to understand and operate: multiple services, a message queue, etc.
Archived objects in S3 were not easy to query – finding records to restore required extra tooling from the customer support teams.

Triggers

A trigger can copy a row to an archive table before it's deleted. The archive table can be a single, generic table that stores JSON blobs:

CREATE TABLE archive (
    id UUID PRIMARY KEY,
    table_name TEXT NOT NULL,
    record_id TEXT NOT NULL,
    data JSONB NOT NULL,
    archived_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    caused_by_table TEXT,
    caused_by_id TEXT
);

CREATE INDEX idx_archive_table_record ON archive(table_name, record_id);
CREATE INDEX idx_archive_archived_at ON archive(archived_at);

The trigger function converts the deleted row to JSON:

CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO archive (id, table_name, record_id, data)
    VALUES (
        gen_random_uuid(),
        TG_TABLE_NAME,
        OLD.id::TEXT,
        to_jsonb(OLD)
    );
    RETURN OLD;
END;
$$ LANGUAGE plpgsql;

Attach this trigger to any table you want to archive:

CREATE TRIGGER archive_users
    BEFORE DELETE ON users
    FOR EACH ROW EXECUTE FUNCTION archive_on_delete();

CREATE TRIGGER archive_documents
    BEFORE DELETE ON documents
    FOR EACH ROW EXECUTE FUNCTION archive_on_delete();

Handling foreign key cascades

When a parent record is deleted, PostgreSQL cascades the delete to child records. These child deletes also fire triggers, but in the context of a cascade, you often want to know why a record was deleted.

One approach is to use a session variable to track the root cause:

CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
DECLARE
    cause_table TEXT;
    cause_id TEXT;
BEGIN
    -- Check if we're in a cascade context
    cause_table := current_setting('archive.cause_table', true);
    cause_id := current_setting('archive.cause_id', true);

    -- If this is a top-level delete, set ourselves as the cause
    IF cause_table IS NULL THEN
        PERFORM set_config('archive.cause_table', TG_TABLE_NAME, true);
        PERFORM set_config('archive.cause_id', OLD.id::TEXT, true);
        cause_table := TG_TABLE_NAME;
        cause_id := OLD.id::TEXT;
    END IF;

    INSERT INTO archive (id, table_name, record_id, data, caused_by_table, caused_by_id)
    VALUES (
        gen_random_uuid(),
        TG_TABLE_NAME,
        OLD.id::TEXT,
        to_jsonb(OLD),
        cause_table,
        cause_id
    );
    RETURN OLD;
END;
$$ LANGUAGE plpgsql;

Now when you delete a user, you can see which archived documents were deleted because of that user:

SELECT * FROM archive
WHERE caused_by_table = 'users'
AND caused_by_id = '123e4567-e89b-12d3-a456-426614174000';

Tradeoffs

Triggers add some overhead to deletes, and the archive table will grow. But:

Live tables stay clean – no archived_at columns, no dead rows
Cleaning up the archive table is trivial with WHERE archived_at < NOW() - INTERVAL '90 days'.
Queries don't need to filter out archived records
Indexes stay efficient
Applications and migrations only deal with live data
Backups of the main tables are smaller

The archive table can even live in a separate tablespace or be partitioned by time if it grows large.

WAL-based Change Data Capture

PostgreSQL's write-ahead log (WAL) records every change to the database. Change data capture (CDC) tools can read the WAL and stream those changes to external systems. For archiving, you'd filter for DELETE events and write the deleted records to another datastore.

Debezium is the most well-known tool for this. It connects to PostgreSQL's logical replication slot, reads changes, and publishes them to Kafka. From there, a consumer writes the data wherever you want – S3, Elasticsearch, another database, etc.

PostgreSQL → Debezium → Kafka → Consumer → Archive Storage

For simpler setups, there are lighter-weight alternatives:

pgstream – streams WAL changes directly to webhooks or message queues without Kafka
wal2json – a PostgreSQL plugin that outputs WAL changes as JSON, which you can consume with a custom script
pg_recvlogical – PostgreSQL's built-in tool for reading logical replication streams

Operational complexity

The main downside is operational overhead. You're running additional services that need to be monitored, maintained, and made fault-tolerant. Debezium with Kafka is a significant infrastructure investment – Kafka alone requires careful tuning and monitoring.

The lighter-weight alternatives reduce this burden but shift reliability concerns to your custom code. If your consumer crashes or falls behind, you need to handle that gracefully.

WAL retention and max_wal_size

A critical configuration is max_wal_size in PostgreSQL. The database retains WAL segments until all replication slots have consumed them. If your CDC consumer stops processing – due to a bug, network issue, or downstream failure – WAL segments accumulate on the primary.

If this continues unchecked, the primary database can run out of disk space and crash.

PostgreSQL 13+ has max_slot_wal_keep_size to limit how much WAL a slot can retain:

ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';

If a slot falls too far behind, PostgreSQL invalidates it rather than filling the disk. This protects the primary but means your CDC pipeline loses data and needs to be re-synced from a snapshot.

You need monitoring and alerting on replication slot lag. If a slot starts falling behind, you want to know before it becomes a crisis.

Tradeoffs

WAL-based CDC provides:

Captures all changes without modifying application code or adding triggers
Can stream to any destination (object storage, data warehouses, search indexes)
The primary database has no additional query load – it just writes WAL as normal

But:

Significant operational complexity, especially with Kafka-based setups
Risk to primary database stability if consumers fall behind
Schema changes require careful coordination between source and consumers
More infrastructure to understand, deploy, and debug

This approach makes the most sense when you already have Kafka or similar infrastructure, or when you need to stream changes to multiple destinations beyond just archiving.

Replica that doesn't process deletes

This is an idea I had never considered until I wrote this post – I haven't tested this, it's just an idea.

What if you kept a PostgreSQL replica (e.g. using logical replication) that just didn't process DELETE queries? Would it effectively accumulate records and updates without conflict over time?

One potential benefit of this is that the archive can be easily queried, so finding old data is simple.

Would the replica have any information about deletes? Could it separate live from deleted data? Would you be able to find a record that was "deleted 2 hours ago in account 123" for a customer? Perhaps instead of ignoring DELETE queries entirely, you could have a specialized replica that transforms DELETE events into an archived_at column.

One potential pitfall here could be schema migrations – would the archive run into difficulty applying migrations over time?

Another downside might be cost – running a replica and keeping all that storage could have a non-trivial cost: it costs money and has operational overhead.

Wrapping up

If I were starting a new project today and needed soft delete, I'd reach for the trigger-based approach first. It's simple to set up, keeps live tables clean, and doesn't require extra infrastructure. The archive table is easy to query when you need it, and easy to ignore when you don't.

Building an access framework using Cedar

Alex B — Fri, 16 Jan 2026 00:00:00 +0000

I’ve been pondering how to provide a robust foundation for access control in atlas9. By “access control”, I mean answering questions like “can Alice edit document Y”.

Access control can be tricky – it’s simple at first, but the complexity reveals itself slowly over time, and one day the system is complicated, poorly understood, fragile, and changes become difficult, slow, and risky.

I wish we would collectively forget the terms authentication and authorization. We need one of those machines from Eternal Sunshine of the Spotless Mind to erase them from our memories. They’re too close in spelling, which causes confusion and complicates discussion. I recommend using “identity” and “access” instead.

Here are some of the factors to consider when thinking about access control:

Users and groups. In the beginning, projects may get away with granting access to specific users, but inevitably there are enough users and resources that users need to be organized into groups. Groups may even contain other groups. This creates permission inheritance (user A is in Group B which is in Group C, and each level has permissions attached).
Groups of resources. Similarly, projects may get away with having all resources in one namespace, but eventually they need to be organized into groups, and maybe even a hierarchy of groups. Think of folders of files, or account→organization→workspace hierarchy, etc. This also creates permission inheritance.
Mapping roles to permissions. Imagine an issue tracking system. Projects may start with “Admin, Editor, Viewer” roles. And then they add a “Metadata Editor” role which gives access to edit only issue metadata but not content, or a “Commenter” role that gives access to comment, but not edit. And then the system adds a “Discussion” feature, or a “Report” feature. Does “Editor” apply to discussion and reports? Is it possible to rename “Editor”?
Querying for permissions. The UI for a project usually needs to know what access a user has, so it can hide a button, show a message, give hints. What about lists? Are you using object-level permissions, or even just showing a table of resources from multiple workspaces, with a “delete’ button for each row? Is that going to be efficient?
Avoiding work. I usually try to do as little work as possible before access checks, so that unauthorized callers are not using up server and database resources. In the worst case, this could be considered an attack vector by malicious users. How does this work in systems with a hierarchy of resources? Is the “organization ID” or “workspace ID” present in the request, or do you need to look it up?
Default deny. Preferably, requests should be denied unless explicitly allowed. The benefit is that it’s harder to forget to add access checks. It can be difficult to apply this in practice. You can have a policy engine that defaults to deny, but you still have to apply access checks in all the right places – servers and databases are “default allow” by design, so it’s up to us to put checks in all the right places.
Multiple services? Multiple languages? Lots of systems have multiple backend services, sometimes written in a mix of programming languages. How will the access system scale across all this? Can anyone truly understand, learn, audit, or test an access system with rules and implementations spread across all this territory?
Object-level permissions are tempting, and tricky. In my experience, most systems get pretty far by managing permissions for broad scopes – they grant access at the “workspace” level, and that access applies to all resources in that workspace. Over time, there’s a constant temptation to add object-level permissions – e.g. “we need contractors to have edit access to the design documents for project foo”. Object-level permission are a natural fit for so many use cases and it’s unfortunate when a project can’t easily provide them. The biggest hurdle is usually solving the “search/filter” use case, where a user can use a “list” endpoint with various filters and efficiently get only the resources they have permission to view.
Testing. Testing access control in a meaningful way is very difficult. If you mock parts of the system, you risk mocking out access checks. And, even a simple system with a few resources types and roles can create a combinatoric explosion of test cases that most people don’t want to manually write and maintain.
Checking referenced resources. I think the focus of access checks is often on API endpoints – the update(id) handler checks whether the user has access to update given resource. But with if the resource references another resource? Does it check access to that resource? What if there’s a bulkImport() handler? Does it reuse the same code to check access to all resources and all referenced resources? Projects need a reliable pattern for organizing this code so that it’s hard to get miss access checks, especially for referenced resources.

So, in summary, what do I want for atlas9?

easily support users, groups of users, and groups of groups
easily support nested groups of resources (like a filesystem tree)
support object-level permissions
check early, check often
default deny policy engine. try to make the system
easy testing, perhaps even automated

Cedar: defining and evaluating policies

With a vague understanding of what I wanted, I set out to search for libraries or systems that would help me build this foundation. I had come across a few projects previously: OpenFGA, OPA/Rego, and Cedar. I liked the look of Cedar – the policy language is easy to read, it’s implemented as a library, it’s fast, it’s created by AWS, thoroughly tested and verified, and more.

Let’s design a “report” feature using Cedar. Users build reports to tell a story with data - a document with text and embedded charts that query data.

We’ll start simple – the report owner can take any action:

permit(principal, action, resource is Report)
when { resource.owner == principal };

Reports can be published publicly, allowing anonymous read access:

permit(
  principal == User::"anonymous", 
  action == Action::"viewReport", 
  resource is Report
)
when { resource.public };

The application code might look like:

func viewReport(user, reportId) {
  report = loadReport(reportId)
  checkAccess(user, "viewReport", report)
  return report
}

func editReport(user, report) {
  checkAccess(user, "editReport", report)
  saveReport(report)
}

Let’s add the ability to grant access to other users.

There are multiple ways to approach this (see Representing Relationships). We’re going to use the “attribute-based relationships” approach here, because it’s great for clearly demonstrating relationships in policies, but we’ll talk about “template-based relationships” and Policy Templates later in the post.

permit(principal, action == Action::"viewReport", resource is Report)
when { principal in resource.viewers };

permit(principal, action == Action::"editReport", resource is Report)
when { principal in resource.editors };

Relationships and Hierarchy

If there are more than a few users, it quickly becomes impractical to manage access for every individual user, so users belong to groups, and access can be granted to groups.

We don’t need to change the policies to support groups. The “in” operator tests hierarchy membership. All we need to do is tell Cedar about our parent-child relationships. That brings us to “entities”, which is how we tell Cedar about our principal and resource data. For example, some of the entities in our report system might look like:

instance Group::"marketing" = {};
instance Group::"contractors" = {};
instance User::"alice" in [Group::"marketing"] = {};
instance User::"mark" in [Group::"contractors"] = {};

instance Report::"1" = {
  editors: [User::"alice", Group::"marketing"],
  viewers: [Group::"contractors"]
};

permit(principal, action == Action::"viewReport", resource is Report)
when { principal in resource.viewers };

I’m using syntax for entity definitions that is not yet implemented, because it’s easier to read. See this RFC. Entities are usually described by JSON. In fact, everything in Cedar can be described with JSON: entities, policies, schema, etc.

When “mark” views the report, Cedar can work out that “mark” is in the “contractors” group, which is in the “viewers” set.

This already supports nested groups, so you could imagine groups matching an org chart, for example. It’s just up to our application data model to describe the relationships to Cedar.

Similarly, users might want to organize reports into groups. You might do that by traversing the directory tree from the report to the root, getting set of users/groups that have viewer/editor access at each level, and adding all that information to the “report.editors” and “report.viewers” sets.

Checking References

Let’s add a wrinkle: reports use datasets to provide charts, but some datasets are sensitive and need to be protected. A user can’t create a report using datasets they don’t have access to, and similarly, a user can’t view a report if it uses data they don’t have access to.

This wrinkle demonstrates references, which is one of those things that might be easy to miss in access control.

permit(principal, action == Action::"viewDataset", resource is Dataset)
when { principal in resource.viewers }

We’ll cover this more later, but note that while Cedar is great, you still need to ensure you execute the proper access checks at the application layer. So we’ll need to update our application code to check these dataset references:

func viewReport(user, reportId) {
  report = loadReport(reportId)
  checkAccess(user, "viewReport", report)

  for ds in report.datasets {
    checkAccess(user, "viewDataset", ds)
  }

  return report
}

func editReport(user, report) {
  checkAccess(user, "editReport", report)

  for ds in report.datasets {
    checkAccess(user, "viewDataset", ds)
  }

  saveReport(report)
}

Cedar seems limited in this respect – it can’t model this relationship, as far as I can tell. It seems like an important characteristic of real world data models. I haven’t done deep research on other policy frameworks like OpenFGA, SpiceDB, OPA/Rego yet. Claude tells me it’s possible to model this in those systems.

// SpiceDB
definition user {}

definition dataset {
    relation viewer: user
    permission view = viewer
}

definition report {
    relation datasets: dataset
    relation direct_viewer: user

    // User can view report if they can view ALL referenced datasets
    permission view = direct_viewer & datasets.all(view)
}

// OpenFGA
type user
type dataset
  relations
    define viewer: [user]

type report
  relations
    define datasets: [dataset]
    define viewer: [user] or viewer from datasets

Perhaps I need to do another deep dive on OpenFGA, OPA/Rego, or SpiceDB. If you want to see that, let me know in the comments.

One item on my TODO list is to come up with a way to model this generically across an application’s data model in Cedar, so that it’s harder to forget to check access to references like this.

Listing Reports

Now for something harder: we’ve implemented object-level access control, but we haven’t implemented a “list” function. Solving this problem efficiently can be tricky.

The brute-force approach is to execute the list query, and check the access for each result.

func listReports(principal, listRequest) {
  results = []
  q = buildReportsQueryWithFilters(listRequest.filters)
  for item in executeQuery(q) {
    if cedar.checkAccess(principal, "viewReport", item) {
      results.append(item)
    }
  }
  return results
}

That’s an incomplete solution – pagination adds significant complexity. You might need to run multiple queries to get a full page of results. Cursor-based pagination might be easier to implement than offset-based pagination. You might have to think carefully about the potential overhead of multiple queries – you might be filtering out lots of rows that the user doesn’t have access to. You might need to add some required filters to reduce the potential scope. OpenFGA has some docs with more detail.

In our hypothetical system, we don’t need to worry about the overhead of Cedar itself – it can easily handle doing lots of policy evaluations, and it’s (hypothetically) running in-process, so there’s no network overhead or batching to worry about (although that could be a real world concern, depending on how you deploy Cedar).

I left a bug in the listReports() code. Can you see it? Imagine you’re reviewing my PR. We just talked about it. I forgot to check that the user can view the referenced datasets! I can easily imagine this bug happening in the real world – someone implements viewReport(), and then 2 months later someone else implements (or rewrites, or duplicates + tweaks, etc) listReports()

Partial Evaluation

Cedar has a fascinating capability called partial evaluation. Partial evaluation allows you to evaluate policies with partial data, and Cedar will return Residuals that describe the missing parts. I can’t do a better job explaining it than the Cedar blog post, so I highly recommend checking it out.

The most interesting bit is that you can use partial evaluation for listing which resources a principal has access to, or which users have access to a given resource. This could give us a more efficient implementation of the listReports() function above – we’d convert the residuals into a SQL query, add the filters from the user’s request, and execute a SQL query that will return only the reports the user has access to. Theoretically, that’s a much more efficient implementation of listReports().

Policy Templates for Generic Access Control

Policy templates allow us to easily fill in policy details with data from our database. Instead of using the attribute-based relationships we described above, we could instead store all permissions in a single table, and link the data with policies at evalution time. We’ll store the access grants in the database:

Table: report_permissions

principal | report_id | action
----------------------------------------------
User/alice | 1 | edit
User/bob | 1 | view
User/anonymous | 1 | view

Then we load this data when we check access:

func viewReport(user, reportId) {
  perms = loadReportPerms(user)
  checkAccess(user, "viewReport", perms)
  report = loadReport(reportId)
  return report
}

Access Control is Tricky

Perhaps the moral of this story is, access control is tricky, even at small scales. Take it a good, hard look before you leap. Don’t back into it, unaware of the complexity ahead. Maybe Cedar will help you (and an atlas9 framework will help with that), maybe OpenFGA/OPA/etc, or maybe you write your own version, but keep it organized and robust.

Future Work

I’ll do a follow-up post in the future to walk through an actual implementation of all the concepts discussed above.

Bonus: Feature Flags, Entitlements, and more

Many projects make use of feature flags to control the rollout of a feature, to provide long-lived controls for operators (e.g. manually disable expensive actions during overload), to provide entitlements (i.e. access to special features you have to pay for), to run A/B tests, and more.

Many projects pay for a service like LaunchDarkly, or maybe they build their own solution. I wonder if Cedar could provide a good foundation for implementing feature flags.

Perhaps feature flags could be implemented as resources:

permit(
  principal,
  action == Action::Flags::"evaluate",
  resource is Flag
)
when {
  resource.deployed ||
  principal in resource.allow ||
  context.account_percentage < resource.percentage_enabled
}
unless {
  resource.blocked ||
  principal in resource.deny
}

Perhaps entitlements could be implemented as:

permit(
  principal,
  action in Action::"SSO_Actions",
  resource
)
when {
  Features::SSO in principal.plan
}

An “enterprise” plan can use entity hierarchy so that “SSO” is in the plan, or the plan could put the SSO feature directly in the “plan” set, for example, if SSO can be sold as an individual add-on feature.

More Resources

There’s a ton of content out there on this topic. Here are some links if you’re interested in diving deeper:

Building a better software experience

Alex B — Mon, 05 Jan 2026 00:00:00 +0000

atlas9 is about an itch I need to scratch.

There is so much stuff to figure out in software projects: API design, databases, builds, deploys, async tasks, auth, workflows, releases, docs, flags, config, tests, telemetry, monitoring, logs, infrastructure-as-code, and on and on.

Large amounts of energy (time, emotion, money) is spent on this stuff. And when teams get this stuff wrong, even more energy is spent either correcting it, or living with it.

I want to build something that takes care of the common stuff, so that software teams are happier and more efficient. But, what exactly should I build? Who exactly am I building it for?

I have thoughts and questions and some opinions, and I’m starting this blog to write those down, in the hope that they will resonate with others and that discussion and community will come to life.

Batteries not included. Glue required.

Part of the struggle is bringing technologies together.

There’s no shortage of incredible tools at our disposal – PostgreSQL, Kubernetes, S3, Dynamo, Docker, Django, Rails, GitHub, Terraform, OpenTelemetry, Datadog, Splunk, AWS, GCP, Azure, etc.

And yet, it feels harder than ever to pull it all together. Kubernetes and AWS are super powerful, and overwhelming. PostgreSQL feels easy and powerful, the query planner changes its mind at 1am, or you run out of connections, or your vacuum isn’t running fast enough, or you need to upgrade, etc. Terraform works great until it doesn’t. For all the words written about observability, I still struggle to set it all up and get the data I need (and I worked for an observability company for years!). Even in a sophisticated, mature web application framework, teams struggle with common things like API error messages, validation, defaults, database transactions or replicas, etc.

And each piece makes it harder to have a smooth developer experience, or one that usefully matches real-world production behavior.

The devil in the details.

I’m very fortunate to have always worked with great people. My favorite thing about work is always the people – they’re smart, experienced, fun, passionate, interesting, and I learn from them non-stop.

And yet, collectively, we (myself included) get so many seemingly simple details wrong. Like keeping fields consistently snake_case or camelCase in an API. Or apply defaults and validation to API resources in a consistent way. Or using database transactions correctly. Or where to set defaults in the many layers of config. I believe we all agree that by the year 2026, it should be really hard to get these seemingly simple details wrong.

It points to the fact that building software systems, especially with more than a couple people, is still hard. I hope atlas9 can be a tool that takes care of more of these details.

At the bottom of this post, I’ve included a long list of things that I’ve seen teams struggle with, or that I have struggled with myself. It’s a long list! And I bet it could be much longer if I A) had a better memory, B) didn’t block out bad memories, C) thought harder about it, and D) asked others to contribute their struggles. I think a lot of these topics are very common in software organizations of almost any size. It’s mind-boggling how much we have to figure out every time, and how much opportunity for missteps that creates.

A blurry vision of atlas9

My experience is mostly in web applications and distributed systems, so atlas9 is focused on the issues I’ve encountered there.

atlas9 starts with some thoughtful writing (and hopefully discussion) about various patterns and tradeoffs. That’s why I’m starting this blog.

atlas9 will most likely include an application framework, most likely written in Go. There are a lot of existing frameworks out there, most of which I’m not yet very familiar with, so I have some research to do. My experience with frameworks is that they’re flexible, and that flexibility leaves a lot of room for decisions to be made. Those decisions can take a lot of effort, and each decisions has some risk of being a misstep. Also, honestly, I’ve become fairly annoyed with some of the patterns encouraged by popular languages and frameworks, and over the many years of my career I’ve developed my preferred style, and I expect that will be reflected in atlas9 to some degree.

I suspect atlas9 will need to be fairly opinionated in some ways. I suspect that “taking care of the common stuff” will mean making decisions about which common patterns to use. The tradeoff might be that atlas9 doesn’t support every way of doing things, it doesn’t fit into every box, it doesn’t work for everyone.

I suspect that atlas9 doesn’t need to be just one design. I’ve been pondering the gap between small, medium, and large projects. A small project could be fine with a single server and a sqlite database. A medium project might want Kubernetes and multiple services. A different medium (or even large) project might be fine on a few servers with a single large postgres database. Can atlas9 be useful at all these scales? Which one am I designing for first?

This is not new.

People have probably been feeling this was about technology forever. “It should be better than this” is perhaps a great driver of innovation in all areas of technology and knowledge.

And there are people out there making it better, like:

Railway
Render
Fly.io
OpenShift
Supabase
many others I’m sure (send me a comment or message with others)

Some of these seem like really cool companies and products. I recognize that atlas9 is covering some of the same ground. That’s ok. There are probably still gaps to fill, and many people aren’t able or don’t want to use these hosted services. Figuring out what atlas9 can do that complements these projects is on the long list of questions I have in mind.

Interested?

If you’re interested in this idea, if you want to follow along or be part of the discussion, please let me know! You can subscribe, you can comment, or you can send me a message directly.

Brainstorm

This is the end of the post. Below is a big, messy list of common topics that I’ve seen software teams encounter.

Leave a comment or send me a message to add to this list. Let me know which common issues or struggles are important to you, and in the future I’ll publish an updated list.

Docs and discussion
Development Environment
Building and Shipping
APIs
Architecture
Databases
Caching and Performance
Messaging and Async
Auth
LLMs
Security
Infrastructure
Environments and Config
Observability
Resilience
Testing
Data Management
Search and Storage
Frontend
Internationalization
Multi-tenancy
Billing
Project Management
Teams
Support and Users
Analytics
Compliance
Costs