DevLog 20250520: Search Engine Architecture

#architecture #elasticsearch #serverless #database

Overview

For an internal database, we needed to implement some sort of text-based search capabilities to help users locate documents. One goal was to be lightweight, hassle-free, easy to set up, expand, and maintain—and ideally, to set it up once and forget about it. It's okay if it costs a bit of computing power, but agility, quick iterations, usability, and robustness are of top concern.

Summary

It's surprising how much we can do without a server. Most serverless or static webpage hosts are happy to serve anything from a few MB to a few GB for free, likely even distributed across the globe. For any particular individual, if the search experience is truly customized to their needs, then not much resource is needed to deliver a satisfying experience.

Let's illustrate this with an example. Suppose that, for any webpage, the information we care about includes the following:

Title
Abstract
URL
Keywords (weighted)
Tags
Last Modification Time
Webpage Language

If we assume the abstract is at most 200 words and the rest adds up to 50 bytes, then each website we index takes about 250 bytes to store. 1KB fits four entries, 1MB fits 4096 entries, and 3MB fits about 12K entries. If a blog updates daily, the most it can produce is 365 entries per year, so 3MB is enough to hold one year's worth of content for 40 blog-sized websites.

If we go a bit crazy, it's reasonable to expect a few hundred megabytes of memory consumption. Let’s say the payload is 300MB—that’s 40 × 100 = 4,000 blog-sized websites for one year. This means that, without a server, and with everything distributed as static content, a user's browser client can happily serve indexed website data locally for 4,000 websites. Considering that most of the time we dwell on just a few dozen sites, and that time relevance usually plays a big role, this leaves plenty of room for imagination.

The Indexing

Call it an indexer or web crawler. In this step, we try to gather information about all available websites and process them into condensed formats like a database, CSV, or JSON.

Conceptually, the job of a web crawler could be greatly simplified if every website simply reported what it had—that’s supposedly the purpose of a sitemap. But in practice, it's surprising how much effort people spend on so-called SEO optimization, simply because there's no straightforward way to submit such information. On the other hand, perhaps it’s a good thing—we don’t want website creators to fool the search engine by submitting dishonest summary information.

One major challenge with web crawling, besides parsing and potential blocking by website hosts, is rendering JavaScript. Nowadays, lots of content is dynamic, meaning it’s not directly available in plain HTML—it may not require a server per se, but much of the visible content is rendered only during runtime. For instance, a dictionary app may only display content once a search keyword is entered.

The Searching

Database search, plain/full-text search, fuzzy search, vector search, and advanced queries—there are so many possible ways to organize data and perform searches, and so many libraries to choose from, it’s surprising that some websites and software still have such poor built-in search functions.

At the current stage, we don’t have absolute recommendations yet, but suffice it to say: advanced platforms like ElasticSearch or even a traditional database are completely unnecessary for the majority of applications. A plain full-text search is efficient enough, and lots of lightweight JavaScript or WASM libraries exist to speed up search—such as Fuse, Lunr.js, and FlexSearch. There are databases built specifically for search (not to mention the many vector databases now available for AI embedding workloads), such as Manticore Search.

The Hosting

As mentioned, the simplest search engine frontend is just a plain website that fetches index data during loading and performs search in-memory on the client side, without any server other than static file hosting. Again, it's worth stressing how powerful this can be—not only versatile and easily deployed, but also sufficient for most small to medium workloads. For typical payloads in the 3–5MB range and up to 300MB at most, this is more than enough. We can fit a lot within such constraints.

ACI.dev: The Only MCP Server Your AI Agents Need

ACI.dev’s open-source tool-use platform and Unified MCP Server turns 600+ functions into two simple MCP tools on one server—search and execute. Comes with multi-tenant auth and natural-language permission scopes. 100% open-source under Apache 2.0.

Star our GitHub!