Forem: Arnaud Dagnelies

Cloudflare Workers performance: an experiment with Astro and worldwide latencies

Arnaud Dagnelies — Mon, 19 Jan 2026 23:31:23 +0000

Why use Cloudflare Workers?

Cloudflare Workers let you host pages and run code without managing servers. Unlike traditional servers placed in a single or a few locations, the deployed static assets and code are mirrored around the globe in the data centers shown as blue dots below. Naturally, this offers better latencies, scalability and robustness.

Their developer platform also extends beyond “Workers” (the compute part) and include storage, databases, queues, AI and lots of other developer tooling. The whole with a generous free tier and reasonable pricing beyond that.

Why am I writing this? I find it fairly good, had a good experience with it, and that’s why I will present it here. This article is not sponsored in any way. I just think it’s somehow a responsibility of developers to communicate about the tools they use in order to keep their ecosystem lively. I’ve seen too much good stuff getting abandoned because there was no “buzz”.

The benefits of using Cloudflare Workers is:

Great latencies worldwide
Unlimited scalability
No servers to take care of
Further tooling for data, files, AI, etc.
GitHub pull requests preview URLs
Free tier good enough for most hobby projects

When not to use it

Like every tool, it has use cases for which it shines and others it is not suited for. This is important to grasp and understanding the underlying technology helps tremendously. Basically, in loads your whole app bundled as a script and evaluates it on the fly. It’s fast and works wonderfully if your API and used frameworks are slim and minimalistic. However, it would be ill-advised in following use cases:

Large complex apps

The cost of evaluating your API / SSR script will grow as your app grows. The larger it becomes, the more inefficient its invocation as a whole will become. There are also some limits how large your “script” can be. Although it has been raised multiple times in the past, the fact that this is extremely inefficient will always remain. Thus, be careful when picking dependencies/frameworks since they can quickly bloat your codebase.
Heavy resource consumption

Due to its nature, it is not suited to compute stuff requiring large amounts of CPU/RAM/time like statistic models or scientific computation. Large caches are problematic too. Waiting for long-running async server-side requests is OK though, the execution is suspended in-between and do not count towards execution time.
Long-lived connections

That’s also problematic. You should rather use polling than keeping connections open.

In other words: “The slimmer, the better!”

It’s kind of difficult to say what’s small enough and when it becomes too large. This is rather suited for small self-contained microservices of modest size. Even debugging using breakpoint might turn out challenging. For such larger applications, traditional server deployments would be more suited.

What will we build?

A “Quote of the Day” Web application.

The purpose is not to build something big, but rather a simple proof-of-concept. The quotes will be stored in a KV store and fetched Client-side. That way, we can measure how fast the whole works and if it lives up to the expectations.

The default version of https://quoted.day is available in two flavours:

https://quoted.day/spa: a static page, fetching the quote text/author asynchronously
https://quoted.day/ssr: Server-Side-Rendering, rendering the page with the quote on the server

I swapped which one is the default from time to time to perform experiments. Performance (latency) may vary depending where you are located and whether what you fetch is “hot” or “cold”. Before we delve into details on how to build such an app, let’s take a look at the performance we can expect.

Benchmarking latencies worldwide

Unlike the internal Cloudflare latency measures, measured “inside” the worker and therefore quite optimistic, we will look at the “real” external latency thanks to the great tool https://www.openstatus.dev/play/checker .

Thanks to that, we can obtain a pretty good idea of the overall latencies that can be observed all over the world. Note however that Australia, Asia and Africa may have rather erratic latencies that “jump” sometimes.

We will also benchmark multiple things separately:

Static assets
Stateless functions
Hot KV read
Cold KV read
KV writes

Also, every case will get “two passes”, to hopefully fill caches on the way, and only record the second one.

Static assets

This was obtained by fetch the main page at https://quoted.day/spa

Region	Latency
🇩🇪 fra Frankfurt, Germany	30ms
🇩🇪 koyeb_fra Frankfurt, Germany	31ms
🇫🇷 cdg Paris, France	33ms
🇳🇱 railway_europe-west4-drams3a Amsterdam, Netherlands	33ms
🇬🇧 lhr London, United Kingdom	31ms
🇸🇪 arn Stockholm, Sweden	32ms
🇫🇷 koyeb_par Paris, France	31ms
🇳🇱 ams Amsterdam, Netherlands	54ms
🇺🇸 ewr Secaucus, New Jersey, USA	32ms
🇺🇸 iad Ashburn, Virginia, USA	36ms
🇺🇸 koyeb_was Washington, USA	35ms
🇨🇦 yyz Toronto, Canada	50ms
🇺🇸 ord Chicago, Illinois, USA	36ms
🇺🇸 lax Los Angeles, California, USA	28ms
🇺🇸 sjc San Jose, California, USA	26ms
🇺🇸 railway_us-east4-eqdc4a Virginia, USA	41ms
🇺🇸 railway_us-west2 California, USA	49ms
🇺🇸 koyeb_sfo San Francisco, USA	29ms
🇸🇬 railway_asia-southeast1-eqsg3a Singapore, Singapore	53ms
🇮🇳 bom Mumbai, India	95ms
🇺🇸 dfw Dallas, Texas, USA	30ms
🇯🇵 nrt Tokyo, Japan	28ms
🇦🇺 syd Sydney, Australia	31ms
🇸🇬 sin Singapore, Singapore	294ms
🇸🇬 koyeb_sin Singapore, Singapore	436ms
🇧🇷 gru Sao Paulo, Brazil	252ms
🇿🇦 jnb Johannesburg, South Africa	559ms
🇯🇵 koyeb_tyo Tokyo, Japan	28ms

Stateless function

his is obtained by fetching the endpoint https://quoted.day/api/time which simply returns the current time.

Region	Latency
🇬🇧 lhr London, United Kingdom	38ms
🇩🇪 koyeb_fra Frankfurt, Germany	32ms
🇳🇱 railway_europe-west4-drams3a Amsterdam, Netherlands	36ms
🇫🇷 cdg Paris, France	75ms
🇳🇱 ams Amsterdam, Netherlands	76ms
🇩🇪 fra Frankfurt, Germany	88ms
🇫🇷 koyeb_par Paris, France	73ms
🇸🇪 arn Stockholm, Sweden	97ms
🇺🇸 railway_us-east4-eqdc4a Virginia, USA	36ms
🇺🇸 koyeb_was Washington, USA	62ms
🇺🇸 ewr Secaucus, New Jersey, USA	95ms
🇺🇸 lax Los Angeles, California, USA	39ms
🇺🇸 sjc San Jose, California, USA	25ms
🇺🇸 iad Ashburn, Virginia, USA	92ms
🇺🇸 dfw Dallas, Texas, USA	90ms
🇨🇦 yyz Toronto, Canada	22ms
🇺🇸 ord Chicago, Illinois, USA	108ms
🇮🇳 bom Mumbai, India	99ms
🇸🇬 railway_asia-southeast1-eqsg3a Singapore, Singapore	45ms
🇯🇵 nrt Tokyo, Japan	27ms
🇺🇸 railway_us-west2 California, USA	99ms
🇧🇷 gru Sao Paulo, Brazil	89ms
🇦🇺 syd Sydney, Australia	26ms
🇸🇬 sin Singapore, Singapore	220ms
🇺🇸 koyeb_sfo San Francisco, USA	26ms
🇿🇦 jnb Johannesburg, South Africa	540ms
🇸🇬 koyeb_sin Singapore, Singapore	354ms
🇯🇵 koyeb_tyo Tokyo, Japan	71ms

Hot KV read

This is obtained by fetching a fixed quote from the KV store using the endpoint https://quoted.day/api/quote/123

Region	Latency
🇬🇧 lhr London, United Kingdom	34ms
🇫🇷 cdg Paris, France	39ms
🇳🇱 railway_europe-west4-drams3a Amsterdam, Netherlands	35ms
🇫🇷 koyeb_par Paris, France	37ms
🇸🇪 arn Stockholm, Sweden	34ms
🇳🇱 ams Amsterdam, Netherlands	77ms
🇩🇪 koyeb_fra Frankfurt, Germany	103ms
🇨🇦 yyz Toronto, Canada	25ms
🇺🇸 dfw Dallas, Texas, USA	33ms
🇺🇸 koyeb_was Washington, USA	55ms
🇩🇪 fra Frankfurt, Germany	168ms
🇺🇸 iad Ashburn, Virginia, USA	106ms
🇺🇸 railway_us-west2 California, USA	52ms
🇺🇸 ewr Secaucus, New Jersey, USA	122ms
🇺🇸 koyeb_sfo San Francisco, USA	33ms
🇺🇸 railway_us-east4-eqdc4a Virginia, USA	123ms
🇿🇦 jnb Johannesburg, South Africa	43ms
🇮🇳 bom Mumbai, India	99ms
🇸🇬 railway_asia-southeast1-eqsg3a Singapore, Singapore	88ms
🇺🇸 ord Chicago, Illinois, USA	69ms
🇧🇷 gru Sao Paulo, Brazil	99ms
🇺🇸 sjc San Jose, California, USA	40ms
🇦🇺 syd Sydney, Australia	64ms
🇺🇸 lax Los Angeles, California, USA	91ms
🇸🇬 sin Singapore, Singapore	345ms
🇯🇵 nrt Tokyo, Japan	126ms
🇯🇵 koyeb_tyo Tokyo, Japan	65ms
🇸🇬 koyeb_sin Singapore, Singapore	856ms

Cold KV read

This is obtained by fetching a random quote from the KV store using the endpoint https://quoted.day/api/quote

Note that each call will cache the result for a day at the edge location, resulting in possibly turning cold reads into hot reads as traffic increases.

Region	Latency
🇩🇪 fra Frankfurt, Germany	131ms
🇩🇪 koyeb_fra Frankfurt, Germany	105ms
🇬🇧 lhr London, United Kingdom	110ms
🇳🇱 ams Amsterdam, Netherlands	130ms
🇫🇷 cdg Paris, France	145ms
🇸🇪 arn Stockholm, Sweden	134ms
🇫🇷 koyeb_par Paris, France	127ms
🇳🇱 railway_europe-west4-drams3a Amsterdam, Netherlands	133ms
🇺🇸 ewr Secaucus, New Jersey, USA	197ms
🇺🇸 ord Chicago, Illinois, USA	201ms
🇺🇸 iad Ashburn, Virginia, USA	220ms
🇨🇦 yyz Toronto, Canada	243ms
🇺🇸 koyeb_was Washington, USA	229ms
🇺🇸 dfw Dallas, Texas, USA	287ms
🇺🇸 railway_us-east4-eqdc4a Virginia, USA	270ms
🇸🇬 sin Singapore, Singapore	288ms
🇺🇸 sjc San Jose, California, USA	245ms
🇮🇳 bom Mumbai, India	502ms
🇿🇦 jnb Johannesburg, South Africa	322ms
🇸🇬 railway_asia-southeast1-eqsg3a Singapore, Singapore	323ms
🇺🇸 lax Los Angeles, California, USA	247ms
🇺🇸 koyeb_sfo San Francisco, USA	217ms
🇺🇸 railway_us-west2 California, USA	300ms
🇧🇷 gru Sao Paulo, Brazil	601ms
🇯🇵 nrt Tokyo, Japan	822ms
🇸🇬 koyeb_sin Singapore, Singapore	574ms
🇯🇵 koyeb_tyo Tokyo, Japan	335ms
🇦🇺 syd Sydney, Australia	964ms

KV writes

This is obtained by fetching quoted.day/api/bump-counter which creates a temporary KV pair with an expiration time of 10 minutes. It kind of emulates the concept of initiating a “session”.

🇫🇷 cdg Paris, France	128ms
🇩🇪 koyeb_fra Frankfurt, Germany	151ms
🇩🇪 fra Frankfurt, Germany	147ms
🇫🇷 koyeb_par Paris, France	194ms
🇳🇱 ams Amsterdam, Netherlands	145ms
🇸🇪 arn Stockholm, Sweden	240ms
🇬🇧 lhr London, United Kingdom	176ms
🇺🇸 dfw Dallas, Texas, USA	212ms
🇺🇸 railway_us-west2 California, USA	238ms
🇺🇸 koyeb_was Washington, USA	305ms
🇺🇸 railway_us-east4-eqdc4a Virginia, USA	295ms
🇺🇸 ewr Secaucus, New Jersey, USA	408ms
🇺🇸 iad Ashburn, Virginia, USA	423ms
🇨🇦 yyz Toronto, Canada	337ms
🇺🇸 ord Chicago, Illinois, USA	359ms
🇸🇬 koyeb_sin Singapore, Singapore	409ms
🇺🇸 lax Los Angeles, California, USA	335ms
🇮🇳 bom Mumbai, India	347ms
🇺🇸 sjc San Jose, California, USA	438ms
🇺🇸 koyeb_sfo San Francisco, USA	247ms
🇸🇬 sin Singapore, Singapore	508ms
🇯🇵 nrt Tokyo, Japan	684ms
🇦🇺 syd Sydney, Australia	713ms
🇯🇵 koyeb_tyo Tokyo, Japan	734ms
🇳🇱 railway_europe-west4-drams3a Amsterdam, Netherlands	1,259ms
🇸🇬 railway_asia-southeast1-eqsg3a Singapore, Singapore	1,139ms
🇿🇦 jnb Johannesburg, South Africa	2,266ms

SSR Page with KV cold reads

Lastly, in this test, we combine the reading a random quote (that usually results in a cold KV read) and renders it server-side in a page.

Region	Latency
🇫🇷 koyeb_par Paris, France	111ms
🇬🇧 lhr London, United Kingdom	108ms
🇳🇱 railway_europe-west4-drams3a Amsterdam, Netherlands	125ms
🇫🇷 cdg Paris, France	133ms
🇩🇪 koyeb_fra Frankfurt, Germany	139ms
🇩🇪 fra Frankfurt, Germany	146ms
🇸🇪 arn Stockholm, Sweden	142ms
🇳🇱 ams Amsterdam, Netherlands	70ms
🇺🇸 railway_us-east4-eqdc4a Virginia, USA	151ms
🇺🇸 koyeb_was Washington, USA	159ms
🇺🇸 ewr Secaucus, New Jersey, USA	201ms
🇺🇸 iad Ashburn, Virginia, USA	209ms
🇺🇸 ord Chicago, Illinois, USA	217ms
🇺🇸 dfw Dallas, Texas, USA	220ms
🇺🇸 sjc San Jose, California, USA	191ms
🇺🇸 railway_us-west2 California, USA	201ms
🇨🇦 yyz Toronto, Canada	255ms
🇺🇸 lax Los Angeles, California, USA	257ms
🇺🇸 koyeb_sfo San Francisco, USA	268ms
🇮🇳 bom Mumbai, India	422ms
🇯🇵 nrt Tokyo, Japan	332ms
🇸🇬 sin Singapore, Singapore	284ms
🇧🇷 gru Sao Paulo, Brazil	327ms
🇸🇬 railway_asia-southeast1-eqsg3a Singapore, Singapore	632ms
🇸🇬 koyeb_sin Singapore, Singapore	677ms
🇿🇦 jnb Johannesburg, South Africa	673ms
🇦🇺 syd Sydney, Australia	385ms
🇯🇵 koyeb_tyo Tokyo, Japan	350ms

Observations

In is interesting to see how you can infer how the KV works just by watching the numbers. It appears the KV store is not actively replicated, but rather KV pairs are copied “on-demand” at remote locations. When cached (by default 1 minute), subsequent reads are fast. The latencies of such “hot” KV pairs are pretty good overall. No complains here. How long the pair remains cached there can also be configured using the cacheTtl parameter during the KV get request. However, the downside of increasing that value is that this cached copy do not reflect changes / updates triggered from other locations during that time.

Unsurprisingly, cold reads have worse latencies. The other thing you can infer from the numbers is that there seem to be an “origin location”, and cold reads latencies increase proportionally according to the distance to this location. Therefore, pay attention “where” you create the KV store, as it impacts all future latencies around the globe. Note that workers KV might change in the future, this is merely an observation of its state right now.

While read operations are OK, the write operations are rather disappointing right now. I expected it to have great latencies too, writing to the “edge” and letting the propagation take place asynchronously, but it is the opposite. Writes appear to communicates with the “origin” storage. The time it takes to set a value gets higher the further away you are from where you created the KV store. This is kind of bad news, because setting/updating values is a pretty common operation, for example to authenticate users. Dear Cloudflare team, I hope you improve that part in the future.

A word of caution

If you develop your webapp, publish it and take a look at it, you will probably not even notice the bad latencies. You will face the optimal latencies with the origin KV store being near you. However, someone at the other end of the planet will have an uglier experience. If that person has a handful of cache misses or writes, the response time might quickly climb into a few seconds before the response arrives. That is not how I would expect a “distributed” KV store to behave. Let us be clear, right now this behaves more like a centralized KV store with on-demand cached copies at the edge.

Quite ironically, it basically feels more like a traditional single-location database right now (+caches). While latencies of a single cache miss or a single write is not dramatic, it can quickly pile up with multiple calls and especially write-heavy webapps risk facing increased “sluggishness” depending on their location. Here as well, being “minimalistic” regarding KV calls should be taken to heart during the conception of the webapp using workers.

Lastly, there was one more setting available in the Worker: “Default Placement” vs “Smart Placement”. I tried both but I did not see noticeable changes within the latencies. I think it’s due to the fact that there is a single KV store call and that it takes time and traffic to gather telemetry and adjust the placement of workers. It might be great, but for this experiment, it had no effect at all.

Single-Page-Applications vs Server-Side-Rendering

Here as well, one is not universally better or worse than the other and the answer which one to use is “it depends”.

Besides strong differences regarding frameworks and overall architecture, it also has practical fundamental differences for the end user. It’s also fascinating to see history repeating itself, where the internet first started with server rendered pages, than single-page-application with data fetching took over and a resurgence of SSR, just like in the past, just with new tech stacks.

SSR is actually the easiest one to explain: you fetch all the required data server side, put everything in a template and return the resulting page to the end user. It takes a bit of time and processing power server-side, is not cachable, but the client gets a “finished” page.

The SPA does the opposite. Although the HTML/CSS/JS is static and cached (hence quickly fetched), the resources are typically much larger due to all the client-side javascript libs needed. Then starts the heavy lifting, where data is fetched and the page rendered, typically while showing a loading spinner. As a result, the total time to render the page is longer.

However, interacting with the SPA is typically smoother afterwards, because interactions just exchange data with the server and make local changes to the page. In contrast, SSR means navigating and loading a new page. Hence, the choice whether SPA or SSR is more suited depends on how “interactive” the page/app should be.

As a rule of thumb, if it’s more like a static “web page”, go for SSR, if it’s more like an interactive “web app”, go for SPA.

Lastly, the nice thing about Astro, picked here as illustrative example, is that the whole spectrum is possible: static pages, SPA and SSR.

Sources

The source code of this experiment is here: https://github.com/dagnelies/quoted-day

If you have a Github and a Cloudflare Account, you can also fork & deploy by clicking here:

If the button doesn’t work, here it is as link instead: https://deploy.workers.cloudflare.com/?url=https://github.com/dagnelies/quoted-day

It will fork the GitHub repository and deploy it on an internal URL so that you can preview it. Afterwards, you can edit the code and it will auto-deploy it, etc.

Note that the example references a KV store that is mine. So you will have to create your own KV store named and swap the QUOTES KV id in the wrangler.json file with yours. You will also have to initially fill it with quotes if you want to reproduce the example. Luckily, there are scripts in the package.json to do just that.

Everything beyond this point would deserve a tutorial on its own. This was merely the result of an experiment, how the latencies hold up and some insights on the platform. Enjoy!

Passkeys / WebAuthn Library v2.0 is there! 🎉

Arnaud Dagnelies — Tue, 13 Aug 2024 15:00:07 +0000

Hello folks,

I'm pleased to announce the release of the v2.0 of my WebAuthn library!

This library greatly simplifies the usage of passkeys by invoking the WebAuthn protocol more conveniently. It is open source, opinionated, dependency-free and minimalistic (9kb only).

👀 Demos

These demos are plain HTML/JS, not minimized. Just open the sources in your browser if you are curious.

📦 Installation

Modules (recommended)

npm install @passwordless-id/webauthn

The base package contains both client and server side modules. You can import the client submodule or the server depending on your needs.

import {client} from '@passwordless-id/webauthn'
import {server} from '@passwordless-id/webauthn'

Note: the brackets in the import are important!

Alternatives

For browsers, it can be imported using a CDN link in the page, or even inside the script itself.

<script type="module">
  import {client} from src="https://cdn.jsdelivr.net/npm/@passwordless-id/webauthn@2.0.0/dist/webauthn.min.js"
</script>

Lastly, a CommonJS variant is also available for old Node stacks, to be imported using require('@passwordless-id/webauthn'). It's usage is discouraged though, in favor of the default ES modules.

Note that at least NodeJS 19+ is necessary. (The reason is that previous Node versions had no WebCrypto being globally available, making it impossible to have a "universal build")

🚀 Getting started

There are multiple ways to use and invoke the WebAuthn protocol.
What follows is just an example of the most straightforward use case.

Registration

import {client} from '@passwordless-id/webauthn'
await client.register({
  challenge: 'a random string generated by the server',
  user: 'John Doe'
})

By default, this registers a passkey on any authenticator (local or roaming) with preferred user verification. For further options, see → Registration docs

Authentication

import {client} from '@passwordless-id/webauthn'
await client.authenticate({
  challenge: 'a random string generated by the server'
})

By default, this triggers the native passkey selection dialog, for any authenticator (local or roaming) and with preferred user verification. For further options, see → Authentication docs

Verification

import {server} from '@passwordless-id/webauthn'
await server.verifyRegistration(registration, expected)
await server.verifyAuthentication(registration, expected)

Look at the docs for registration and authentication for the corresponding verification examples.
Or simply interact with real-life examples in the Testing Playground.

⁉️ Why a "Version 2"?

Nobody likes breaking changes, so what's the reason for it? The "Version 2" is not only a complete overhaul of the first version, it also differs from the previous mainly regarding its default behavior and "intermediate payloads". The said, it remains similar, striving for simplicity and ease-of-use.

A bit of ❤️ for security keys

Previously, this lib defaulted to using the platform as authenticator if possible. The user experience was improved that way, going straight to user verification instead of intermediate popup(s) to select the authenticator. It was a smooth experience.

This behavior was born from a time where credentials were always device-bound. The terms "synced credentials" and "passkeys" did not even exist when the initial version of this lib was made. Nowadays, most platforms / authenticators now sync credentials in the cloud. While this is certainly convenient, the security and privacy guarantees are not as strong as with device-bound credentials. That is why security keys now deserve some love, since they are the only ones providing such strong guarantees.

The new behavior is to let the user choose the authenticator by default. It's also the native protocol's default. We want to give the choice to use security by default, since they are more secure.

In order to better support for security keys, some defaults also changed, like the user verification now being preferred and being discoverable also being preferred. Both should allow a broader set of security keys to be usable by default.

Reflects latest protocol changes

The protocol evolved with time, and still is. For example, autocomplete with "conditional mediation" was added. It's an addition I'm not fan of, but certainly has its merits. It basically let's you trigger authentication using the autocomplete feature of a username input field. This is especially useful because there is no way to know if a credential has already been registered for the user or not. However, be wary that this feature is not universally supported for all platforms / browsers / authenticators.

Another recent change is the usage of hints ("security-key", "client-device", "hybrid"), which should replace the authenticatorAttachment and transports properties in the long term. They were kind of wacky anyway. That said, only Chrome supports hints for now, but it's handled in a backwards-compatible way by this library.

Payloads compatible with other server-side libs

This is another crucial large change. The response format has been changed completely to be compatible with the output as the PublicKeyCredential.toJson() method. An official part of the spec that only FireFox implements.

That way, it is possible to use the browser-side of this library and use almost any other server-side library for your favorite platform. Using this intermediate format increases compatibility cross-libraries in the long term.

🙏 Last words

The original project https://passwordless.id is currently in limbo between proof-of-concept and production-ready. It would simply take more time than I can afford. However, I wanted to wrap up the changes in this library, since the scope is way smaller, and it seems more used by a wider range of people. It also follows the same goal: getting rid of annoying passwords while providing more security. Although I wanted to achieve this goal on a grander scale, this library should at least help others to use passkeys more easily.

...but it's kind of crazy that I do this. Actually, I'm currently on holiday and I should rather play with my kids outside rather than writing this article. Well, I'm still kind of an old nerd too. I just wanted to wrap it up. Consider this lib stable, use it to your heart's content and if you want to help the open-source ecosystem keep its balance, consider sponsoring this project too.

Have a nice week,
Arnaud

What do you think of the look and feel?

Arnaud Dagnelies — Fri, 05 Jul 2024 08:09:46 +0000

Hi there,

I'm about to release the "version 2" of a passkeys lib and I'm currently in the middle updating the docs too.

What do you think of the current look and feel?

https://webauthn-ciy.pages.dev/

The demos are not yet updated and the "content" is not yet finished, but early feedback is great too. I wonder too if I should put the demos in a separate repository.

Open-Source Exploitation

Arnaud Dagnelies — Mon, 10 Jun 2024 17:57:45 +0000

Hi folks,

This is not my title, but the title of a presentation I saw few days ago. I wanted to share with you as I think it is important for the open source community as a whole.

For my part, I agree completely to what he said. What about you?

Passkeys F.A.Q.

Arnaud Dagnelies — Sun, 26 May 2024 19:34:03 +0000

The WebAuthn protocol is more than 200 pages long, it's complex and gets constantly tweaked. Moreover, the reality of browsers and authenticators have their own quirks and deviate from the official RFC. As such, all information on the web should be taken with a grain of salt.

Also, there is some confusion regarding where passkeys are stored because the protocol evolved quite a bit in the past few years. In the beginning, "public key credentials" were hardware-bound. Then, major vendors pushed their agenda with "passkeys" synced with the user account in the cloud. Then, even password managers joined in with synced accounts shared with the whole family for example.

How the protocol works, and its security implications, became fuzzier and more nuanced.

What is a passkey?

Depending on who you ask, the answer may vary. According to the W3C specifications, it's a discoverable public key credential.

If you ask me, that's a pretty dumb definition. Calling any public key credential a passkey would have been more straightforward.

What is an authenticator?

The authenticator is the hardware or software that issues public key credentials and signs the authentication payload.

Hardware authenticators are typically security keys or the device itself using a dedicated chip. Software authenticators are password managers, either built-in in the platform or as dedicated app.

Is the passkey hardware-bound or synced in the cloud?

It depends. It can be either and it's up to the authenticator to decide.

In the past, where security keys pioneered the field, hardware-bound keys were the norm. However, now that the big three (Apple, Google, Microsoft) built it in directly in their platform, software-bound keys, synced with the platform's user account in the cloud became the norm. These are sometimes also dubbed "multi-device" credentials.

Can I decide if the created credential should be hardware-bound or synced?

Sadly, that is something only the authenticator can decide. You cannot influence whether the passkey should be synced or not, nor can you filter the authenticators that can be used.

Concerns have been raised many times in the RFC, see issue #1714, issue #1739 and issue #1688 among others (and voice your opinion!).

Are passkeys a form of 2FA?

Not by default. Passkeys are a single step 2FA only if:

The credential is hardware-bound, not synced. Then this first factor is "something you possess".
The flag userVerification is required. Then this second factor is "something you are" (biometrics) or "something you know" (PIN code).

While requiring user verification would be ideal, this also restrict the hardware authenticators that can be used. Not all USB security keys have fingerprint sensor or PIN.

Are hardware-bound credentials more secure than synced ones?

Yes. When the credential is hardware-bound, the security guarantees are straightforward. You must possess the device. Extremely simple and effective.

When using synced "multi-device" passkeys, the "cloud" has the key, your devices have the key, and the key is in-transit over the wire. While vendors go to great length to secure every aspect, it is still exposed to more risk. All security guarantees are hereby delegated to the software authenticator, whether it's built-in in the platform or a password manager. At best, these passkeys are as safe as the main account itself. If the account is hacked, whether it's by a stolen password, temporary access to your device or a lax recovery procedure, all the passkeys would come along with the hacked account. While it offers convenience, the security guarantees are not as strong as with hardware bound authenticators.

The privacy concerns are similar. It is a matter of thrust with the vendor.

How to deal with recovery when using hardware-bound credentials?

A device can be lost, broken, or stolen. You must deal with it. The most straightforward way is to offer the user a way to register multiple passkeys, so that losing one device does not imply locking oneself out.

Another alternative is to provide a recovery procedure per SMS, TOTP or some other thrusted means. Relying on solely a password as recovery is discouraged, since the recovery per password then becomes the "weakest link" of the authentication system.

Discoverable vs non-discoverable?

There are two ways to trigger authentication. By providing a list of allowed credential ids for the user or not.

If no list is provided, the default, an OS native popup will appear to let the user pick a passkey. One of the discoverable credential registered for the website. However, if the credential is not discoverable, it will not be listed.

Another way is to first prompt the user for its username, then the list of allowed credential IDs for this user from the server. Then, calling the authentication with allowCredentials: [...]. This usually avoids a native popup and goes straight to user verification.

There is also another indirect consequence for "security keys" (USB sticks like a Yubikey). Discoverable credentials need the ability to be listed, and as such require some storage on the security key, also named a "slot", which are typically fairly limited. On the other hand, non-discoverable credential do not need such storage, so unlimited non-discoverable keys can be used. There is an interesting article about it here.

Can I know if a passkey is already registered?

No, the underlying WebAuthn protocol does not support it.

A request to add an exists() method to guide user experience has been brought up by me, but was ignored so far. See issue #1749 (and voice your opinion!).

As an alternative to the problem of not being able to detect the existence of passkeys, major vendors pushed for an alternative called "conditional UI" which in turn pushes discoverable synced credentials.

What is conditional UI and mediation?

This mechanism leverages the browser's input field autocomplete feature to provide public key credentials in the list. Instead of invoking the WebAuthn authentication on a button click directly, it will be called when loading the page with "conditional mediation". That way, the credential selection and user verification will be triggered when the user selects an entry in the input field autocomplete.

Note that the input field must have autocomplete="username webauthn" to work.

What is attestation?

The attestation is a proof of the authenticator model.

Note that several platforms and password managers do not provide this information. Moreover, some browsers allow replacing it with a generic attestation to increase privacy.

Do I need attestation?

Unless you have stringent security requirements where you want only specific hardware devices to be allowed, you won't need it. Furthermore, the UX is deteriorated because the user first creates the credential client-side, which is then rejected server-side.

There was a feature request sent to the RFC to allow/exlude authenticators in the registration call, but it never landed in the specs.

Usernameless authentication?

While it is in theory possible, it faces a very practical issue: how do you identify the credential ID to be used? Browsers do not allow having a unique identifier for the device, it would be a privacy issue. Also, things like local storage or cookies could be cleared at any moment. But if you have a way to identify the user, in a way or another, then you can also deduct the credential ID and trigger the authentication flow directly.

What about the security aspects?

The security aspects are vastly different depending on:

Synced or hardware-bound
User verification or not
Discoverable or not

A hardware-bound key is a "factor", since you have to possess the device. The other factor would be "user verification", since it is something that you know (device PIN or password) or are (biometrics like fingerprint).

Many implementations favor synced credentials with optional user verification though, for the sake of convinience, combined with discoverable credentials. This is even the default in the WebAuthn protocol and what many guides recommend.

In that case, the security guarantee becomes: "the user has access to the software authenticator account". It's a delegated guarantee. It is obvious that having the software authenticator compromised (platform account or password manager), would leak all passkeys since they are synced.

What about privacy aspects?

Well, if the passkeys are synced, it's like handing over the keys to your buddy, the software authenticator, in good faith. That's all. If the software authenticator has bad intents, gets hacked or the NSA/police knocks on their door, your keys may be given over.

Note that if a password manager has an "account recovery" or "sharing" feature, it also means it is able to decrypt your (hopefully encrypted) keys / passwords. On the opposite, password managers without recovery feature usually encrypt your data with your main password. This is the more secure/private option, since that way, even they cannot decrypt your data.

My Passkeys lib, now with authenticator icons

Arnaud Dagnelies — Mon, 26 Feb 2024 21:14:38 +0000

Hello folks!

Since I have little time to cater about Passwordless.ID lately, I wanted to at publish some little update. So here we go, the WebAuthn library (to enable passwordless login using passkeys) now also delivers more information about the "authenticator". In particular the icon and whether it is a multi-device or device-bound credential. (The latter one is actually just interpreting the existing flags)

👇 Check out the simple demo if you want.

👇 Or the list of authenticators.

👇 Or the playground to experiment with the various options.

⚠️ As a side note, please do not forget that the challenge should be randomly generated server side! I have found repositories on GitHub using hardcoded challenges, which is of course a red flag regarding security since it opens the way to replay attacks!

Read here or here if you want a brief introduction of how the WebAuthn protocol behind Passkeys works.

Starting from version 1.4.0+, the parsed registration payload now looks as follows, with extra properties synced, icon_light and icon_dark.

{
  "username": "Arnaud",
  "credential": {...},
  "client": {...},
  "authenticator": {
    "rpIdHash": "T7IIVvJKaufa_CeBCQrIR3rm4r0HJmAjbMYUxvt8LqA=",
    "flags": {...},
    "counter": 0,
    "synced": false,
    "aaguid": "08987058-cadc-4b81-b6e1-30de50dcbe96",
    "name": "Windows Hello",
    "icon_light": "https://webauthn.passwordless.id/authenticators/08987058-cadc-4b81-b6e1-30de50dcbe96-light.png",
    "icon_dark": "https://webauthn.passwordless.id/authenticators/08987058-cadc-4b81-b6e1-30de50dcbe96-dark.png"
  },
  "attestation": "o2NmbXRjdHBtZ2F0dFN0bXSmY2FsZzn__mNzaWdZAQB25KbRrQPjtlx0qZ2Tsvh2YHaPTPTUJznShhr5XnP3zBmVv..."
}

The icons are deliberately made as links to keep the library compact. Otherwise, it would be bloated way beyond the megabyte. Instead, all icons are available as links.

These authenticator icons were made thanks to this repository which contains the icons as data urls, sometimes in PNG, sometimes in SVG, sometimes square, sometimes rectangle, some tiny, some really large, etc. So some extra scripting was done to homogenize them as 64x64 PNGs and host them so that each "aaguid" can be shown through a direct link.

Thanks for reading. If you like it, leaving a comment or a star is appreciated. Getting some feedback is always nice.

Have a nice day!

Why I prefer Maven over Gradle

Arnaud Dagnelies — Tue, 20 Feb 2024 07:58:21 +0000

In the Java world, one of the first question developers encounter is "should I use Grade or Maven as build tool?". It's a fundamental decision which will stick to you with time. And when googling it, Gradle's biased comparison even pops up as the top search result (at least for me).

At first sight, Gradle looks cool:

Their website looks way nicer and polished than Maven's one
The syntax is much more compact than Maven's verbose XML
Gradle is "newer" while Maven is "older"
Gradle is much faster (according to them)

No wonder people pick it up when faced with uncertainty and just wanna get started. So now, let me tell what's wrong with Gradle IMHO and why Maven is still the better option, even so many years later.

Configuration vs Scripting

Basically, the pom.xml that you define in Maven is a "configuration". You define the name, the version, the list of dependencies, etc. Since it follows a specific schema, with a set of properties to define, you can also look at it visually through a UI for example. It's a declarative definition of your library/app.

On the opposite, a Gradle build script is exactly that: a script. It's using the Groovy language, or recently also Kotlin, to let you write anything you want. Let it sink in, you use a programming language to define what the build should do. You can import other scripts, send a HTTP request to check the current weather and insert a funny UI generated picture in your build artifact.

While they have many aspects in common, they nature ("configuration" vs "script") is what differentiates them fundamentally.

You may think: "Isn't it great if I can do anything with that build script?! It's ultimate freedom!". That is right, but this boon is also a curse.

When I see a Maven project, with a pom.xml, I know what it does and where to find what. It's always the same. Directories, commands to run, changing the version, whatever, it's the same for all maven projects.

When I see a Gradle project, I have no idea what the build script does. If you don't have a clear documentation ready, you'll have to dive into the build script to actually discover and try to understand what it exactly does.

The price of freedom

It's not rare that you need something specific in your build. In both Maven and Gradle, it's possible to do so, but here also their approach is opposite.

In Gradle, it's straightforward. Since it's scripting, just write whatever you want, you can do anything very easily. Your own build stages, calling functions, using variables, importing some other scripts, whatever. It's easy.

In Maven, it's the opposite. Adding something custom is more difficult. You will have to use a plugin to enable the specific functionality, or even write a plugin yourself if really necessary. While writing a plugin is definitely more work, this kind of also enforces reusability though.

The takeaway here is the same as before. While Maven builds tend to always follow the same build stages and conventions, Gradle builds tend to become more and more complex and customized over time, because it's so easy to "just add a few lines" to the build script. Look at it after a few years and the Maven pom.xml is likely almost as readable as the first days while the Gradle build.gradle script became rocket science.

As an exercise, I picked a random gradle.build file from another team at work to look at it. It had over thousand lines and the few dependencies it had were externalized in another file and combined in a fancy way.

On the opposite, pick any Maven project, and the list of dependencies will always be in the same place, in the <dependencies>...</dependencies> tag.

History repeating itself. As a side note, it's interesting to see that in the very early days of Java, before maven was born, build scripts were the norm. In the beginning they were plain shell scripts invoking javac ... to compile the source code, packaging it, etc. Then came "ant" to do the same, in a bit more structured way but still tended to become customized and complex over time. Then one day came the idea to use a more declarative approach, by defining a project and its dependencies while letting the tool take care of how it is build. Maven was born. Then, some day, Gradle was born, because "I want to customize stuff".

Gradle lies in your face

Now, this is a little grudge of mine against Gradle's marketing habits. When going on their website, they will feature a "Gradle vs Maven" comparison claiming that Gradle is "oh so much faster" than Maven and the following picture.

Now, let's take a closer look...

First, what's shocking is that a "Clean build with tests" is so much faster than the original build! It's almost instant! Including tests! ...let me get this straight: this is not "clean" at all. It's just doing nothing. I find Maven much more sensible in that case, it actually rebuilds everything from scratch. To go a little bit further, a "build" in Maven will just check for changes and compile changed files, which would result in a similar figure, while a "clean build" will remove the whole directory and re-build everything. I find this should be the expected behavior unlike Gradle's "clean build" not cleaning anything. After all, the aim of a clean build is usually to fix issues due to some undesirable thing lying around in the build directory, for whatever reason.

Then, let's look at the normal case: is Gradle really twice as fast? Well, here is another question for you: who compiles the source code? ...got an idea? Well, it's the javac compiler from the JDK, it's not the build tool! So why would Gradle be twice as fast?! Here is the trick: Gradle runs the tests in parallel while Maven do it sequentially. That is the reason! Gradle ain't faster or anything, it just runs the tests in parallel. I dislike this default. It's just a question of time until you get tests having side-effects and race conditions. Then you'll obtain "Heisenberg tests" succeeding sometimes and failing sometimes, depending on how their executions overlap. You'll wonder why and waste lots of time investigating the issue. Moreover, it usually runs in a background jobs after commits anyway.

Now, while I dislike Gradle's defaults, what I'm really annoyed about is how they distort the truth. They should say "we run tests in parallel by default and our 'clean' does nothing instead!". That should have been the correct way to put it instead of using their misleading statements insinuating that they compile faster.

Gradle is not simple

For Maven, the scope of dependencies is relatively straightforward:

Compile (the "usual" dependency)
Test (for tests)
Provided (provided at runtime by JDK or a container)
Runtime (quite rare. For drivers or alike available at runtime but not for compiling)

It's enough and I never needed anything else.

Gradle on the other hand has lots of scopes:

api
implementation
compileOnly
compileOnlyApi
runtimeOnly
testImplementation
testCompileOnly
testRuntimeOnly
...a few more deprecated scopes
...a few more classpath scopes
...you can also extend and combine scopes

Well, you basically get it. Gradle is "super-customizable", so much that you often wonder what it exactly does or that you make a mistake without realizing. Gradle sells it as "Maven has few, built-in dependency scopes, which forces awkward module architectures" but IMHO it's Gradle which is confusing and overcomplexified here while Maven has exactly what's actually required.

That is just the tip of the iceberg. But basically Gradle is super-customizable while maven favors conventions. No wonder Gradle is also a company that thrives with support and training. If it was simple, such things would not sell.

Gradle needs maven, but not the other way round

Every single library in the Maven Central Repository must have pom.xml. It's the declarative definition of the library containing name, version, license, etc, and most importantly the list of its dependencies. Without a Maven pom.xml there would be no Central Repository nor dependency management possible.

Whether you use Grade or Maven, both read the pom.xml Maven definition to build the dependency tree. It's at the core of the dependency dependencies system to pull all transitive dependencies and resolve version conflicts.

In other words, Maven can live without Gradle, but Gradle still needs Maven to exist. Maven just applies a standardized build based on the pom.xml while Gradle builds in in some way and generates a pom.xml as a build step if you want to actually publish your library.

Maven isn't perfect either

Now, I bashed a lot about Gradle, but Maven isn't perfect either. It has issues too. Their website sucks IMHO, it could welcome YAML as more compact alternative format, some plugins should be built-in and the format itself could be tweaked here and there. But overall I find it OK considering it's a format that lived more than 20 years.

The other drawback is a lack of flexibility. It's indeed rigid in how it expects your project to be and may become problematic if you need for example to mix multiple different techs. For example a building a node project, running a python script, etc. as part of the build procedure to place some extra stuff inside the produced artifact. But for that IMHO, it's better to use CI scripts, running as GitHub actions or GitLab pipelines to build a "mixed bundle". Let each tech stack build its own artifacts and combine them later through scripting. I favor that approach over pushing the build scripts customizations too far.

Take it with a grain of salt

While I bashed at Gradle and praised Maven, it should be taken with a grain of salt. At the end of the day, they are just tool and either can be used wisely or like a fool.

With maven too, you can also produce "monster pom.xml files" by using tons of plugins and super-complex configurations overriding all defaults. Likewise, Gradle is not necessarily a monster. Use it wisely, keep your build script clean, refrain from adding custom build steps and you will do just fine. It's not bad per-se.

It's just that by default, in the hands of average developers, Maven's pom.xml will tend to remain understandable (because it takes effort to escape out of the conventions) while Gradle's build.gradle will tend to become more complex and customized over time (because it's so easy to do so). All the small shortcuts now and little extra steps that stray away from the build conventions tend to become liabilities in the long term.

As said previously, Gradle's great flexibility and customizability of the build is both a boon and a curse. Although I prefer to build "generic" projects where I can, because it's by far simpler to maintain in the long run, using Gradle definitely has its place when you need more specific stuff that requires customization.

TL;DR: as a rule of thumb, Maven's pom.xmtends to remain fairly generic with time, while Gradle's build.gardle leans towards being highly customized and therefore complex. This is due to their "nature", while Maven is based on a rigid project "definition", Gradle is a free form build "script".

How security and privacy impacts the database design

Arnaud Dagnelies — Tue, 05 Dec 2023 08:07:06 +0000

What I like about the banner picture above is that it is a very useful analogy to our problem at hand. A database is nothing else than a glorified electronic version of it. Lots of small storage boxes containing some data, and identified by a label or number, usually called "primary key".

In security, the main concern is protecting against breaches altogether. However, it is also a good idea to think about mitigating impact in case of breaches. Assuming we deal with some kind of sensitive user data, let's have some thoughts on what should be on the labels and the content of these wooden boxes.

What should be the user "identifier"?

Should it be the e-mail? Or a username? ...

TL;DR: No, it should be an anonymous ID.

How to identify users is something to be decided at the very beginning. It's crucial information that spreads everywhere: in the database, in links, in software logic... Soon, everything will be tied to it and too late to change. Doing so will be extremely difficult in the best case, or impossible in the worst case.

But why should it be an anonymous ID? ...It's not only about data breaches.

That ID is probably going to show up everywhere. In URLs, in logs, in databases, sent to third-party services... So, in the grand scheme of things, privacy-wise, it's better to be anonymized.

Security-wise, using a username/e-mail as the primary ID is not a vulnerability by itself. However, it slightly increases the "attack surface". For example, imagine you have a REST API with some endpoints accidentally not properly controlling access. If they accept a username/e-mail as parameter, it's trivial to invoke these endpoints with other user's username/e-mail as input, since these are fairly easy to find out. On the other hand, using anonymized IDs would already make exploiting such a vulnerability more difficult, since you don't know other's ID in the first place. This not only applies for REST API, but in all kinds of hacking, whether it's digging in a stolen file or eavesdropping traffic. It does not make your system invulnerable, but it's an additional layer of safety.

Imagine again the wooden boxes above, just don't place names on it for everyone to see but use some other identifier instead. For example, use the SHA256 of the username/e-mail with a salt. That's simple, deterministic and anonymous.

What about the stored data?

Most of the time, plain data is read and written directly to the database. The connection is typically protected by credentials to restrict access. This is all good and fine but the main access credentials become critical. If the access credentials are leaked or stolen, all the data can be extracted. Plain and simple.

Depending on the sensitivity of the data, like for example personal or financial data, another layer of protection is required: encrypting the data. Instead of placing the original text sheet in the wooden box, you encrypt it beforehand. Reading and writing encrypted data in the database makes it unreadable to others. Even database administrators would not be able to pry into the stored data, nor any other software not in possession of the encryption key. Likewise, this also increases privacy.

Note that although most databases also feature some encryption mechanisms, it typically refers to the raw data persisted on disk. To make the database storage files unreadable when stolen. These should not be mixed up and ideally, both should be used.

TL;DR: always encrypt all personal, financial or otherwise sensitive information when storing it in a database

Hash what you can!

Not only passwords but also recovery codes and other nonces.

As an anecdote, it reminds me of a story about a hacker who found an SQL injection vulnerability, enabling him to pry into the database. Of course, the user passwords were hashed. Even the sensitive data was encrypted. On the other hand, the recovery codes were stored "as is". The hacker now just had to start a recovery procedure for an arbitrary user account, and then look up the corresponding plain text recovery code. Bam! The hacker could now simply complete the recovery procedure and reset the password for any user account. That escalated quickly.

So if you don't really need the plain text data, don't even store it, just a hash of it.

TL;DR: hash not only passwords but also recovery codes, nonces, challenges and such.

Pairwise identifiers

User accounts are usually identified by some sort of ID. It might be a hash, a UUID, a username, an e-mail, whatever. When some third-party interacts with your service, it'll likely use this ID to identify the user or resource.

This also means that this ID is "universal" and that everyone out there will use this ID as an identifier. A user can be tracked that way, and attack attempts can be carried out that way using the known ID from another party.

A concept to push privacy and security further is called "pairwise identifiers". Each "consumer" of your service will be provided with different user identifiers.

Third-party service XYZ: can you please send me data related to user ABC123 ?

Provider: sure ...let me check ...for you XYZ, the account ABC123 is mapped to "Bob" ...here you go.

Of course, "Bob" is not the username, but should be the internal primary id.

It's about providing a mapping so that each third-party sees different IDs, so that they cannot correlate users among themselves. Also, it prevents leaked IDs from being used by others. As usual, these benefits both privacy and security.

TL;DR: it's best to provide third-party services anonymized IDs, unique third-party for each third-party. Both for privacy and security.

Good ol' cookies

This last advice is not directly related to the database but rather to how to maintain browser sessions. There are two camps: "use a good ol' session ID cookie" vs "use a JSON Web Token" (In an Authorization header or as cookie). The latter sounds more modern and trendy but it has drawbacks regarding security.

Let's review the good old way of handling sessions first. It's fairly straightforward: set a "Http-Only" cookie with some random session identifier when the user is authenticated. Voila, done. The browser will automatically send the cookie on each subsequent request and it cannot be read nor written by scripts. Server-side, you can retrieve the session data based on that id. Then, when the user signs out you can remove the cookie and clear your session data. Same if the user is inactive for too long. Simple, effective and there is not much that can go wrong.

Regarding JWT usage, there are two dangers with it. First, if the token which is typically stored browser side is stolen, the attacker can impersonate the user, even long after the user signed out... Except if you keep a database of revoked tokens, which not only loses the main benefit of JWT being "stateless" but also adds undesired complexity.

The second danger is the signing key being stolen, whether it's because of an accident, a vulnerability exploit or a malicious insider. Although this is unlikely because the key should be well protected, the potential consequences of a breach are catastrophic. Basically, your service would be doomed overnight because attackers would be able to impersonate any user at will, bypassing authentication altogether. This is a worst-case scenario that would not be possible for sessions identified by random IDs.

TL;DR: prefer random identifiers in a Http-Only cookie over using JWT tokens for user sessions

Doing it "later" is no good

It might be tempting to delay it, since it introduces complexity to the codebase. However, it's no good. The more you delay it, the more effort and time will be consumed in a later stage refactoring anyway. But even more importantly, while some changes can be delayed, others would break the API and data compatibility for all "consuming" software!

In particular, everything regarding hashed user IDs and pairwise identifiers will be breaking all software and integrations relying on these IDs. Changing these is a "hard cut" that should be done ASAP early on, ideally during the conception phase.

Other changes have a less critical impact. For example, hashing of short-lived recovery codes, nonces, challenges, etc. can be done in a version update. This will invalidate existing codes, nonces, and challenges and cause a small disturbance, but everything will work fine again afterwards.

The only change which is delayable with less concern is the data encryption. This is an internal database change, opaque to the API which remains unchanged and the consumers as well. It can be done in one fell swoop by encrypting all the data in the database at once using a background process.

TL;DR: don't delay it, do it ASAP since it's breaking changes

Side benefits

Following these recommendations will increase the security and privacy of your system, but that's not all. Doing this, although it sounds complex, has its benefits too regarding code structure. It forces the software to access user data through a streamlined access point instead of fetching it directly from the database. When all calls for user data go through this access point, it's also easier to monitor, control access and properly handle the data in this one place.

At least, that's what we experienced after our "endless refactoring" at Passwordless.ID. The refactoring is large, and the "version 2" breaks compatibility, requiring us to deploy a separate version and clear all user accounts. It is a very hard cut. However, we are pleased with the result. The system is now more secure, with better privacy, and better structured than ever before. Something I am proud of!

TL;DR: it will even make your codebase structure cleaner!

Thanks for reading! If you liked it, leave a comment!

Endless refactoring ...when things keep piling up

Arnaud Dagnelies — Sat, 18 Nov 2023 09:26:22 +0000

Refactoring is a necessary thing. The reasons are numerous, but most of them have one thing in common: they are underestimated. As the old adage says: "the devil lies in the details". And typically these "details" surface during refactoring itself, leading to even more stuff to refactor, sometimes creeping up to dangerous levels.

This is a tale of such refactoring for Passwordless.ID, or rather how a large refactoring was aborded in the middle in favour of an intermediate solution.

Well, it turns out the refactoring was closer to a whole rewrite. But before delving into the details, let's check the "why" it was done, since the goals are what drive such refactoring.

Why?

The big refactoring started to take place, mainly driven by a follow-up of UI, API, DB: pushing "three-tier architecture" too far?.

Before, front-end and back-end were in two separate repositories, with their own lifecycle. However, from a workflow perspective, having both the API and UI in the same repository is more convenient. Changes always go hand-in-hand, you have a single URL for previews, no CORS necessary, a single pipeline and deployment, etc. The code is almost the same, it's just more convenient to work with a single repository.

The second goal was improving user experience, and specifically reducing "time-to-rendering" as much as possible. Currently, this was slowed down by:

130kb assets loading due to frontend framework (not that much, but way larger than it could be)
"Waterfall loading" (load initial assets, run script, fetch dynamic content, load other stuff, render it all ...duh, feels sluggish)
Cross-origin requests (adds one more "preflight" HTTP round-trip)

The combination of these things resulted in rendering times above a second when not cached, sometimes even two if the CDN caches also missed. In other words, it was slow despite being a super simple page.

Lastly, some structural changes to improve security and privacy would require changes on the data level. This includes changing the primary ID from the username to a hash and another round of encryption for the stored data. This would break the compatibility with existing data. Please note though that the privacy and security of the existing version is by no means compromised. This is about being paranoid and security/privacy in-depth by adding one more protection layer.

How?

There were three main areas related to the refactoring.

Combining both codebases with same domain previews
Making the UI more lightweight
Adapting the data layer

These are just a few sentences but represent lots of work. In particular, the UI refactoring. Making this more lightweight is easier said than done. Investigating alternatives and shifting away from the larger Vue framework turned out to be quite the burden.

The mistake made

We proceeded as follows

Merge both codebases
Refactor the backend code while we are at it 🙄
Investigate frontend framework alternatives 😅
Try out various frontend techs😬
Start porting UI🥵

What started as a refactoring was turning more and more in the direction of a full rewrite.

The refactoring of the backend code was a bit time-consuming but worth it IMHO. It's slightly more lightweight and forcing a proper file structure is a good thing. It makes the code more organized and neater. You can directly pinpoint API endpoints to source code files and the related middleware without even looking at the code.

On the other hand, exploring the vast space of technologies available to replace the front-end was a bottomless pit. Even the first steps of refactoring the UI became a huge time sink. It's not only comparing the few major frameworks, it's even broader, related to the methodology: ranging from SPA (single-page-applications), to SSR (server-side rendering), to SSG (server-side generation), to bundling tools for plain HTML/TS/JS/CSS. Each of these methodologies has its own ecosystem and tools, with many frameworks to compare. Last but not least, many frameworks are often hybrids and can be "configured" differently to span a range of SPA/SRR/SSG.

In retrospect, it was a big mistake. It slowly but surely evolved into a complete rewrite. The UI refactoring is a huge endeavour that should have been left untouched and made separately. Not because it should not be done, but because it should not have been done now.

How it should have proceeded

The whole order in which we started the "big refactoring" was wrong. Of course, it wasn't originally expected to be that time-consuming either, it was just "let's refactor that". In hindsight, we should have paid much more attention to the priorities and plan the refactoring accordingly. It should have been done in smaller steps, one after another, and in an order that makes sense according to the priorities.

In particular, the UI-related one should have been done last. Of course, from a marketing and business point of view, one might disagree. The holy UI/UX would take priority, with the slogan "make it nice first, with a slicker UI which loads super fast". Something that "looks awesome" ...but that would hide and delay upcoming breaking changes, which would just frustrate early adopters later on. I'm against these kinds of short-term wins at the price of long-term liabilities. The sooner the breaking change is done, the better. That is also why any kind of promotion is on hold until this time-consuming refactoring is completed.

What should have been prioritized is merging both codebases into one for single domain deployment, data schema refactoring to use hashes as IDs and one more encryption layer. The reason to handle them first is because it's a breaking change. These two things make it incompatible with the original first version.

Merge both codebases
Ensure single domain dev previews work fine
Apply schema changes for hashed IDs
Add extra encryption round

Once this is done, the new version can be published. And afterwards, the UI and back-end related "improvements" can take place. Those which take more time but keep compatibility.

To abord or to persevere?

We're now stuck "in the middle" of the rewrite including the new UI, so it's kind of annoying to drop the work in progress halfway. On the other hand, it looks like the road ahead is still quite long. So, despite it's annoying to let the work in progress halfway done, it's better for Passwordless.ID as a whole. The sooner the "v2" is released, the better.

Also, what we did not investigate was how to make the existing UI lighter. We aren't talking about excessive amounts here, it's ~130kb gzipped JS/CSS over the wire. But for a simple sign in/up page, it's still unnecessarily large. After some experimenting, it turns out that ~100kb can be spared, just by dropping the bootstrap-vue-next lib in favour of using plain bootstrap classes directly and a few workarounds for special components. That the lib came in with that much "baggage" and could not be properly "tree-shaken" came unsuspected. Dropping it, the page goes from ~130kb => ~30kb, already much nicer.

The full UI rewrite will still take place later on, to be even more efficient and internationalized. But for now, we will make the bare minimum UI changes and focus on the breaking change first, which will hopefully be completed sooner.

Stay tuned!

UI, API, DB: pushing "three-tier architecture" too far? 🤔

Arnaud Dagnelies — Sun, 17 Sep 2023 08:11:51 +0000

What may look ideal in theory, may turn out cumbersome in practice.

-- Myself

During inception, the Passwordless.ID "app" was built in the purest form of a three-tier architecture.

The UI - A vue app "compiled" into a single-page-application
The API - An API built with Cloudflare Workers
The DB - A distributed DB as a service

In particular, each "tier" was completely independent, built with its own tech stack and deployed on a dedicated subdomain.

The UI: https://ui.passwordless.id
The API: https://api.passwordless.id
The DB: internal network

This complete separation of "tiers" may look ideal. It seems to be full of advantages.

There is a clear separation of concerns
Each tier can be updated independently
One could make changes to the UI without affecting the API and vice-versa.
They can scale independently
The UI is fully browser cached
Each "tier" (UI, API, DB) could theoretically be swapped out with another tech in the long term...

This might seem like an exemplary separation of concerns, something ideal to strive for. It's also what we strived for during inception. Sounds great right? Well, it turns out it's not that great ...it's actually pretty bad.

Decoupling back-end and front-end sucks

Being able to update and make changes to UI and API independently sounds great, but in practice, it turns out it's very rarely needed.

The vast majority of the time, when you work on something, whether it is a new feature, a change or a bug, you typically update mostly the UI and API hand in hand. You change files in both, test both together, and deploy both.

In the daily routine, having two repositories, toolchains and domains to deploy to turns out to be counter-productive. It'll lead to two commits, two builds, a "joint deploy", etc. It's not that big of a deal either, it's just annoying.

Cross-origin requests suck

Due to the UI and API being on two distinct subdomains, requests between the UI and the API are now "cross-origin requests". It applies to different subdomains too. Configuring the API to allow such "cross-origin requests" is no big deal when you know how it works, but some developers may find it cumbersome.

The subtle disadvantage is that it introduces one more round-trip: the Preflight requests. These requests are automatically sent by the browser to check if the requests from UI to API are allowed, before sending the actual requests. While not dramatic, it makes the UI slightly less responsive since it doubles the first request's latency.

Lastly, it also has an impact on session handling and security aspects due to its cross-origin nature. However, that's a whole other topic itself.

Latency sucks

You probably know it, when you develop locally, everything is snappy. The page loads instantly and you are happy ...and once it's deployed, you notice that the experience on your phone in "real life" is not that great, especially before all the browser caching kicks in.

The slowness comes from various things:

the loading of assets from the SPA
the UI pre-flight requests to the API
the UI actual requests to the API
the API calls to the distributed DB
the "rendering" of the page content

Each of these things makes the UI sluggish and is a consequence of the distinct tiers being fully separated. To be more precise, it is because network calls are involved between all parts, each one increasing the latency and sluggishness by a notch.

Moreover, you typically don't notice this "sneaky" behaviour during the initial phase of development. When testing locally, everything appears lightning fast since everything occurs locally without network latency. Often, the sluggishness introduced by the network calls is only discovered when the first prototype is deployed and going "live".

That is why an application which combines everything locally, or in a same subnet, is usually much more responsive than having distinct "tiers" like UI / API / DB each separated by a network, which is common in a SaaS world.

Distinct subdomains suck

Whether you want to show your users a feature preview, provide a developer sandbox, make A/B testing or reproduce some bug in real-life conditions, "staging" environments are always useful.

If both the UI and API are packaged in the same app, deploying it at a single domain like https://prod.passwordless.id would be straightforward. Then, you could also work on a feature branch and deploy it to https://new-feature.passwordless.id to test it out in a live environment.

However, this becomes much more complex if you have it split. It would become something like this:

It also requires some plumbing, so that the new feature UI talks with the corresponding API URLs, including adjusting CORS properly. This is extra work and is error-prone.

If the UI and API were bundled together in the same (sub)domain, that would not be a problem since relative URLs could simply be used and CORS are not involved either.

SPAs (sometimes) suck

SPAs like Vue, React or Angular are not bad. You have plenty of libraries with all kinds of widgets and fancy stuff. You can just "magically" quickly generate whole apps with some initializer. But it has a cost too: the learning curve, the complex toolchain, the clunky dependencies ...and the initial page load time due to larger size and rendering delays.

It's a tradeoff. While SPAs typically have a longer initial loading time, they offer complex widgets and increased interactivity in return. It offers ways to structure complex web applications in a modular way to keep their complexity under control. All these things are great ...if you need them. Otherwise, when you just need a few basic pages, it would likely turn into useless overhead.

In the end, whether SPAs "make sense" totally depends on the app. The more complex and user interaction-heavy the app is, the better suited it will be for SPAs. However, in the case for Passwordless.ID, which has a relatively simple UI, it was counter-productive.

Doing it as a Vue SPA was great to get started quickly, but in the meantime, it hinders me more than benefits me. The UI library used was bug-ridden, the various toolchains between UI, API and deployment platform do not always play well together, the resulting bundle is 400kb big and it'd cost time and effort to reduce it and the bad resulting latency is the nail in the coffin. Good ol' HTML ain't that bad after all.

Back to the basics

Lately, there is a renaissance of good ol' server-side templating. The basics are making a comeback under two umbrella names: SSR (Server-side rendering) and SSG (Server-side generation). SSG means templating at build time, like "generating" pages in various languages, while SSR means templating with dynamic data, like showing the result of a database query. Both have their roles and are complementary.

It's just going back to the forgotten roots of the web, noticing that after all, it's quite handy to produce HTML with the right data inside directly. It's simple, fast and "slim". This is a contrast to the SPAs which typically inflate, requiring larger JS assets, fetch the data in a second step and add rendering delays.

Sadly, the ecosystem is very fragmented in this area.

Porting software sucks

It would be foolish to leave the "architecture" decision purely to theoretical arguments. Moreover, it usually involves switching or at least adapting the technology stack. As such, its ecosystem plays a crucial role. If you go against the "intended usage" of your tech stack, you may fight an uphill battle not worth it.

In particular, at the time of Passwordless.ID's inception in mid-2022, Cloudflare pages functions simply did not exist yet. As such, the option to package both (with Cloudflare) was not even possible at that point. Pages functions appeared later, by the end of 2022. It certainly would have been able to use a more traditional technology stack at that time. However, the deployment and scalability comfort from a Cloudflare Pages/Workers combo was what pulled us over. It was a pragmatic choice rather than the ideal tech stack.

The point is that it would now be possible to combine the back-end and front-end in a single codebase. Is it worth porting the existing codebase? Tough question. I'd say "probably" but it's a substantial effort. The main issue is that in our case, the whole ecosystem around Cloudflare pages functions is very young. It lacks tooling, libraries, Q&A, documentation and so on. It is a "bleeding edge" right now.

Let's make it suck less

Do you know what also sucks? Authentication. It sucks for users (because they create oh so many accounts), it sucks for developers (because it's so complex) and it sucks for security (because passwords are vulnerable).

So at least, let's try to make authentication suck less and use Passwordless.ID. Think of it as a free universal account, a free public service, that makes your developer's life easy, is comfortable for the users and is more secure.

In the meantime, we'll start porting the "tiered" app to a "bundled" app, making it even better, swifter and more lightweight. Thanks for reading and stay tuned!

Randomly Generated Avatars

Arnaud Dagnelies — Wed, 19 Jul 2023 08:10:24 +0000

As a follow-up from an earlier article regarding the update to randomly generated default avatars for Passwordless.ID, I wanted to post a "how to". This is a beginner tutorial since making such avatars is actually really simple.

TL;DR; Here is the full demo.

The image format

The first thing you should think about is the image format, usually, one of:

Jpeg: great for real user photos due to the high compression ratio. However, this compression also produces some "blur" on lines and sharp edges. As such it is not ideal for the avatars we are going to make.
PNG: theses have lossless compression. In other words, every pixel remains exactly the same as it was originally drawn. Edges and lines remain "sharp".
SVG: these are scalable vector graphics. Unlike a "raster of pixels", it is a declarative format describing shapes and paths.

Of course, you could also save it as a "100% quality" Jpeg to avoid any quality loss, but then it is larger than PNGs. Jpeg compression is amazing though for common photos.

In our case, we picked SVG for the upcoming avatar pictures. In the past, SVG was kind of avoided because support was not always well supported for all software platforms. This is however largely in the past.

SVG offers several benefits: the first is being scalable. Due to its vector nature, it is perfectly sharp at any scale, even if you zoom in on a 4K display. Other "raster" formats like Jpeg or PNG become "pixelated" when zooming in. The other is being more compact. While the byte size of Jpeg/PNG grows with picture size, SVG grows proportional to the shape's complexity. For relatively simple stuff like the avatars here, they are super compact.

The SVG "template"

SVG is an XML format that describes the shapes. As such, what will be generated is a big XML string. To be more exact, we will fill the template below with the appropriate values.

  <svg xmlns="http://www.w3.org/2000/svg" width="100" height="100">
    <defs>
      <linearGradient id="gradient" x1="${startX}" y1="${startY}" x2="${endX}" y2="${endY}">
        <stop offset="10%" stop-color="hsl(${startHue}, 100%, 50%)" />
        <stop offset="90%" stop-color="hsl(${endHue}, 100%, 50%)" />
      </linearGradient>
    </defs>
    <rect x="0" y="0" width="100" height="100" fill="url(#gradient)" />
    <text x="50" y="55" text-anchor="middle" dominant-baseline="middle" font-size="75" font-family="Times New Roman" fill="#ffffff">${char}</text>
  </svg>

Once this template is filled with meaningful values, you will obtain an avatar SVG image that can be stored as a plain normal ".svg" file.

Alternatively, you also deliver it as "data URL" since it is quite compact. This simply means encoding the resource directly instead of a "plain URL" fetching it. It is composed of two parts: the mime-type (image/svg+xml in this case) and the (base64 encoded) data.

This can be used like any other URL in the src tag of an image as follows: <img src= "data: image/svg+xml;base64,{{the-base64-encoded-svg}}"

Voilà, you got your image!

Getting some random values

The missing step is now filling this SVG template with some random values. Alternatively, if you want something more deterministic, you could also use the hash value of the name for example.

As you saw in the SVG template, instead of using RGB colors, HSL colors were used. This stands for Hue-Saturation-Lightness. This makes it easy to generate bright colors from all rainbow colors, with maximal color saturation and average "lightness".

  // Gradient colors
  const startHue = Math.round(Math.random() * 360);
  const endHue   = Math.round(Math.random() * 360);

  // Gradient direction
  const angle = Math.random() * 2 * Math.PI

  // Calculate the start and end points of the gradient
  const startX = (Math.cos(angle) + 1) / 2;
  const startY = (Math.sin(angle) + 1) / 2;
  const endX = 1 - startX;
  const endY = 1 - startY;

  // The character to appear on the avatar
  const char = name.charAt(0).toUpperCase()

For the gradient direction, it's a bit more tricky since an angle cannot be provided directly. There are some "transforms" available, but to ensure the widest compatibility with SVG renderers, sticking to the basics seems a safe bet. As such, the angle is converted to starting and ending coordinates for the gradient.

Thank you

The resulting full source code can be seen in the example provided at the beginning. :)

A Pen by Arnaud Dagnelies (codepen.io )

Extracting addresses from OpenStreetMaps

Arnaud Dagnelies — Fri, 19 May 2023 08:19:52 +0000

Why?

Because there is no worldwide quality source for addresses!

Really, that's no joke. There are many commercial providers for "industrialized countries" of variable quality/pricing but worldwide coverage is lacking, the data formats are diverse and the license terms provider specific.

There are also some open source projects related to addresses though, each with their gochas. Two of these projects are mentioned in the last section "Honorable Mentions" at the end of this article, along with their drawbacks.

This project is born in order to provide quality addresses with worldwide coverage under an open license, by directly extracting addresses from the raw data dumps of OpenStreetMap.

Birth of OpenStreetData.org

How does the it look like? Here is a screenshot, but if you prefer, check out the website directly.

It is divided into two parts: extracts and addresses. Another "points of interest" was planned, but not further developed due to lack of time.

Country dumps

Country extracts are provided in two formats:

PBF, the native OpenStreetMap binary format. This format is very compact and many tools can handle it efficiently. Nevertheless, it is not always very practical to handle due to its low-level nature. It's basically a huge list of points with IDs, lines that reference these IDs and relations that reference the lines.
GeoJson sequences. It's a text file where each line is a "feature": a JSON object with arbitrary properties and a geometry with coordinates. Although the file size is typically larger and the processing sometimes slower, it offers other benefits. The JSON format is universal, the line-based sequence makes it straightforward to filter it with grep-like tools and the geometry can be parsed directly without requiring to go through the whole file.

Note though that both formats are not 100% equivalent. During the conversion process, some choices were required to be made. In particular, in the original PBF a "closed line" (where the last point is the same as the first) could be interpreted as a line or as an area either way. There is no clear-cut indication whether it's a "line" happening to turn in a circle, like a roundabout, or a polygonal "area", like a building outline. This led to "closed lines" being interpreted as lines or polygons based on a lot of hand-picked feature properties. For example, if building=... was part of the properties, it was considered a polygon, except if an area=false tag was also present, and so on.

Administrative areas

Despite not being shown on the site, extracting precise boundaries of a country's provinces, regions, counties, cities, suburbs and so on was the first crucial step. How a country is subdivided into smaller areas varies greatly from country to country and is abstracted under the name "administrative areas" of various levels.

This step is crucial because of the way addresses are extracted. Streets and houses were extracted using "spatial joins" with the administrative areas. Their coordinates were used to determine which administrative areas (city, county, province...) they belong to, as well as the postal code ...if postal code areas are defined in the country.

Currently, the reason of missing (or wrong) addresses for some countries are improper mapping of the administrative areas.

Streets

Here is an example of the "streets" for Austria:

suburb	country	state	province	city	postal_code	street_name
Abtsdorf	AT	Oberösterreich	Bezirk Vöcklabruck	Attersee am Attersee	4864	Abtsdorf
Abtsdorf	AT	Oberösterreich	Bezirk Vöcklabruck	Attersee am Attersee	4864	Altenberg
Abtsdorf	AT	Oberösterreich	Bezirk Vöcklabruck	Attersee am Attersee	4864	Attergauer Landesstraße
Abtsdorf	AT	Oberösterreich	Bezirk Vöcklabruck	Attersee am Attersee	4864	Attersee
Abtsdorf	AT	Oberösterreich	Bezirk Vöcklabruck	Attersee am Attersee	4864	Atterseestraße
...	...	...	...	...	...	...
	AT	Vorarlberg	Bezirk Feldkirch	Marktgemeinde Rankweil	6830	Wüstenrotgasse
	AT	Vorarlberg	Bezirk Feldkirch	Marktgemeinde Rankweil	6830	Zehentstraße
	AT	Vorarlberg	Bezirk Feldkirch	Marktgemeinde Rankweil	6830	Zieglerweg
	AT	Vorarlberg	Bezirk Feldkirch	Marktgemeinde Rankweil	6830	Zunftgasse
	AT	Vorarlberg	Bezirk Feldkirch	Marktgemeinde Rankweil	6830	Übersaxner Straße

168769 rows × 7 columns

It extracted all streets having a name from the raw data and determined the administrative areas and postal code it belongs according to their centroid. As such, it is a slightly simplified streets list. If a street might cross multiple cities or postal codes for example, it will solely be listed in the "main one" (according to its center). For more precise addresses, see below.

Note that "suburb" may be empty depending on the size of the city. This is normal since not all cities are further divided into suburbs.

Houses

Houses is a dataset listing each house (anything with a house number) individually, including its coordinates and the administrative areas it lies within.

Addresses

In this case, the houses are "merged" into streets with house numbers. Unlike the "streets" approach, it results in a more fine-grained dataset.

it includes only streets with at least a single house (number)
it differentiates between street sections with house number ranges belonging to different administrative areas or postal codes
it differentiates between different sides of the street (with odd/even house numbers) belonging to different administrative areas or postal codes
it has boundaries

Here is an example of such an address file for Austria.

	postal_code	city	street	x_min	x_max	y_min	y_max	house_min	house_max	house_odd	house_even
0	1010	Vienna	Weihburggasse	16.375769	16.375769	48.205242	48.205242	26	26	True	True
1	1010	Wien	Abraham-a-Sancta-Clara-Gasse	16.362970	16.363213	48.209789	48.209910	1	2	True	True
2	1010	Wien	Akademiestraße	16.370855	16.372425	48.200877	48.203575	1	13	True	True
3	1010	Wien	Albertinaplatz	16.368138	16.369344	48.204084	48.204750	1	3	True	True
4	1010	Wien	Alte Walfischgasse	16.371740	16.371740	48.203559	48.203559	9	9	True	True
...	...	...	...	...	...	...	...	...	...	...	...
147137	9991	Gemeinde Dölsach	Waidachweg	12.825955	12.827117	46.830659	46.831055	4	9	True	True
147138	9991	Gemeinde Dölsach	Wenzl PLatz	12.841072	12.841634	46.826521	46.826902	1	3	True	True
147139	9992	Gemeinde Iselsberg-Stronach	Großglockner Straße	12.841043	12.858008	46.833271	46.854501	1	206	True	True
147140	9992	Gemeinde Iselsberg-Stronach	Iselsberg	12.835091	12.855994	46.833822	46.846260	5	212	True	True
147141	9992	Gemeinde Iselsberg-Stronach	Stronach	12.849133	12.858230	46.826562	46.833270	2	63	True	True

146322 rows × 11 columns

It may not be perfect, for example, the first line with a misinterpreted city name is quite mysterious.

Challenges

"Big Data"

Dealing with large data is challenging. It's not thousands of points, it's not millions, it's many billions of points, lines, polygons and relations.

Seems like a detail? Well, for example, you cannot even load the planet's data at once in memory. It's simply too big.

You cannot just "do as you please" with inefficient code. Every line of code, every operation, must be crafted with care, well thought out, and fine-tuned to keep processing time and memory to a minimum.

As an example, just for processing the data of a single country, even 32Gb RAM is not enough for larger countries and it takes many hours with the current code, despite best efforts.

Producing precise country extracts

There are sites like geofabrik.de providing country extracts to download. However, they turned out to be not precise enough for me. They use "simplified country border polygons" that are "cutting corners" and therefore missing addresses in areas near the borders. So I had to "split the planet" myself.

To do so, the first step was to extract exact country boundaries. Interestingly, these might change over time. Usually, it's minor modifications like slightly adjusting the border or correcting mistakes. But sometimes the border might move a bit more in "unstable" parts of the world. The point here is that these borders are not "definitive" but evolve slightly over time.

The next step is splitting the world into country extracts. Here again, it cannot be naively done in a single step. Doing so, even 256Gb RAM would not suffice to split at once. So the splitting must be done in multiple steps: first in continents, then in regions, then in countries so that it fits in a "reasonable" amount of memory.

And cutting whole continents with a super precise boundary constituted from millions of points is not efficient either. On the other hand, computing the total bounds of the continent is pointless too. For example, the outer bounds of just France would cover almost the whole world since it possesses many islands around the world as part of its territory. You get the point, some extra work must be done to simplify the geometry without losing stuff but without including too much either.

Then, there are ways or area relations that cross boundaries. Some things from the raw data are not always clear whether it's a "closed line" or an "area", and so on. It's full of technical details which make even producing what look like simple "country extracts" challenging.

Heterogenous data

The OpenStreetMap raw data is not a homogenous clearly defined dataset. It is a huge amount of points, lines and relations, each with completely arbitrary properties. For example, a statue might be a point with metadata indicating when it was built, and from whom, along with some tourist guide number. Depending on where you look on the map, you may also notice different habits of mappers using a diversified arsenal of "tags" to describe things and the community as a whole has different opinions on how to do things, for example with addresses, which often have local flavours.

If you dig into the raw data of OpenStreetMap, you will find interesting things. For example, you will find tags like addr:street=... and addr:city=... which sounds promising. These are also very simple (and quick) to extract since it's attached directly to the data. Great right? Well, it would be this data was complete but it is far from. Depending on the country you are looking at, it might be mostly widespread or barely used. Even if it's there, the coverage and the content are usually quite fuzzy. For example, the street might be named "Wall Street" on one building while another building uses "Wall St.". Likewise, the city in one building may be "N.Y. City" while another uses "New York". Postcodes may also be written in individual houses, but not match the postal boundary accurately, etc. This makes processing these tags directly error-prone. It's better than nothing but there are ways to make it better.

Namely what we did. Spatial joins of houses/streets points into administrative/postal areas in order to extract the most information possible. If those areas are not mapped, a fallback to the tags is used, but only as "fallback" since they are usually not that precise.

Manual labor

Doing this is quite some work. It's not just running a process and be done with it. It's craftsmanship where you change a few lines of code and manually inspect the results. Just checking if more streets/houses/addresses are produced is not sufficient either. It might be that the output is of worse quality because street names are duplicated or listed in the wrong "areas" or some other data mistakes. It might also be that the "couple of lines change" works perfectly for one country but breaks in another because of local differences, like for example the presence or absence of postal codes.

Sometimes, you also see odd things in the data. When this happens you usually spend some time to investigate "why" it is so. Is it the raw map data that is strange? Is it some situation you did not think of? Is there a bug in the code? Is some third-party library not working as expected?... It's really full of weird things, from buildings on the map having mistakenly used "national boundaries" tags to sudden performance drops in third-party libraries when calling a certain function.

Addresses are crazy

Below, I will illustrate how addresses are crazy. It's not something that is homogenous worldwide. It's full of regional quirks.

Is it a country or not?

You may think that something as basic as countries and boundaries is clear-cut. But it is not. Take Kosovo for example.

For half of the world (marked in red), Kosovo constitutes a province of Serbia, while for the other half of the world (marked in blue) Kosovo is recognized as an independent country.

...and that's not unique to Kosovo. There are plenty of regions in the world where territory is disputed, where border shift with local wars and where sovereignty depends on who you ask.

What stance do I take here? I simply use the list of countries as defined by the united nations, defined by their country codes ISO 3166-1. It might not be ideal, but it is pragmatic.

A city with many borders

On a small scale level, borders can be crazy too. For example, check out the little town of Baarle-Nassau, located in the south of the Netherlands, near the Belgium border.

This town contains 22 small exclaves of the Belgian town Baarle-Hertog, some of which contain counter-exclaves of Nassau. The borders cross streets in the middle, sometimes multiple times and a single house might have a Belgium address on one side and a Netherlands address on the other. As you see, extracting addresses can quickly become challenging. ;)

A city center without street names

Not all addresses are based on "streets". Take a look a Mannheim for example. There, the city center is divided like a big grid.

There, each "block" has an identifier, like "C3" while the streets are unnamed. Likewise, house number does not belong to a "street" but to a block. In other words, your address might be "C3, 17" if you live in building 17 of block "C3".

Fancy house numbers

Do you want to use regexes to filter valid house numbers? Well, that might not really work out. For example, the following image is a valid Vietnamese house number, near the coasts of Ho Chi Minh city.

The world is full of surprises. Also regarding addresses, it's full of diversity and local quirks and I believe there is nothing that does not exist.

Honorable mentions

There are also two noticeable open source projects trying to bring addresses to the public domain.

openaddresses.io

This is probably the most famous one. It works by running various "scraping scripts" against various "raw data sources". The result of this approach has few drawbacks though, directly related to its approach.

the licensing is problematic. Basically, it says "use this data according to the license from the data source" ...which is not obvious, since the original issue is not directly linked, often in native language and the licensing terms makes the usage of this data questionable.
the coverage is lacking
the scraping sometimes breaks or is outdated because of changes in the raw source
despite being "open", some things are obfuscated and make reproducing or direct downloads difficult

osmnames.org

Despite less known, this is IMHO a better source of addresses. It is based on addresses extraction from OpenStreetMap and therefore has worldwide coverage and a homogenous license: the "ODbL - Open Database License".

The only drawback it has IMHO is:

the lack of postal codes
the admin_level mapping not ideal

The lack of postal codes may seem like a detail, but it is crucial for addresses. Without it, addresses are simply incomplete. Since this project is open source, I also tried to improve it by adding postal code (see issue) but it turned out too be too difficult/challenging for me. Mostly because I am unfamiliar with PostGIS. The code however, is of quality. This lead me to the current project.

The second issue is more subtle, and leads to missing addresses in some countries because administrative boundaries are not properly mapped. Also, the code is not suited for experimentation and from my understanding, there were ways to "get more" out of the raw OpenStreetMap data dumps than how they did it.