Forem: Adam Tauber

Hister: The Most Privacy-Respecting Search Engine

Adam Tauber — Thu, 07 May 2026 10:19:39 +0000

Privacy and web search are in constant tension. Every time you type a query into a search engine you are handing over something valuable: a window into what you are thinking. Most people accept this trade-off without much thought. This post breaks down the real privacy risks at every layer of the search stack and explains exactly where Hister fits in.

We will take a look at different privacy issues related to online search services, metasearch engines and opening search results. Then examining Hister's solutions to these problems.

Privacy problems with web search

Online search services

Popular search engines like Google, Bing, and their derivatives are advertising businesses. Your search queries are the raw material. They are stored, profiled, cross-referenced with your browsing history, location, and demographics, and used to model your behaviour far beyond any single search session.

The deeper structural issue is unverifiability. Even a search engine advertises it as privacy respecting, it cannot be audited. You cannot confirm whether queries are truly deleted, whether there is no logging, whether their system isn't compromised, whether your IP address is separated from your query log, or whether anonymised data is actually anonymised. You are asked to trust a black box operated by an organisation whose commercial interests are usually directly served by retaining as much data about you as possible.

Self-hosted metasearch engines

Self-hosted metasearch engines like SearXNG (I'm the original author of Searx btw) and similar projects are a meaningful step forward. By routing your queries through your own server, they decouple significant amount of metadata from the query as seen by Google or Bing. That is a real and valuable guarantee.

But the guarantee has a hard ceiling: metasearch engines are permanently dependent on external search providers. Every query still leaves your infrastructure and travels (potentially both) to Google, Bing, or other providers. Those providers see a query, a timestamp, and an IP address (even if it is your server's IP rather than your personal IP). If the upstream provider correlates queries from the same source IP over time, or if your metasearch instance is the only one making requests from that IP, the anonymisation shrinks considerably.

Another important disadvantage is that metasearch engines can provide no protection against data leakage through the search queries. The search terms are always forwarded to the external providers even if the search query contains sensitive data.

Visiting search results

This privacy surface is rarely discussed when talking about search engine privacy, but it is significant.

When you click a search result you visit a website that does not know you arrived from a private search engine. That website may load dozens of third-party trackers, advertising networks, analytics platforms, social widgets each of which observes your visit and can correlate it with your identity across the web. Many of these trackers operate at the network layer and cannot be blocked at the browser level without breaking the page.

Beyond passive tracking, pages can be tempered, can contain malicious scripts, credential-harvesting forms, or drive-by exploits. Before you visit a page you have no way to inspect it. The act of visiting is, in itself, an exposure.

How Hister addresses each layer

A fully local, self-contained index

Hister indexes content you choose to index: pages you visit via the browser extension, URLs you crawl explicitly, or local files on your machine. The index lives entirely on your own hardware. There is no remote server, no third-party cloud storage, no sync service. A query never leaves your infrastructure.

This eliminates the entire trust problem that applies to online services. There is nothing to verify because there is no external party involved. Your query log is a file on your machine that you control completely.

No external search provider calls

Unlike metasearch engines, Hister does not call Google, Bing, or any other search provider at query time. The search runs entirely against the local index. There is no outbound network request triggered by a search query, not even to a self-hosted upstream.

This solves the biggest privacy issue of metasearch engines. Your queries produce zero external network traffic.

Offline previews as a tracker firewall

Hister's most distinctive privacy feature is its offline preview. When a page is indexed, its readable content is stored locally. When you open a result in preview mode, you read the locally stored content.

This means you can read a result, follow an idea, and return to the page days later without the remote server ever knowing you visited. Trackers embedded in the page never execute. Third-party scripts never load. You are completely invisible to the origin and to every analytics or advertising service the site uses.

This is a qualitatively different protection from anything a browser extension or DNS-level blocker can offer, because the page content simply never reaches the network.

Hister's honest limitations

No tool offers unlimited capability, and Hister is no exception.

Index coverage. Hister can only search what it has indexed. A conventional search engine has crawled hundreds of billions of pages; Hister has crawled whatever you have pointed it at. For exploratory searches on topics you have never researched before, the local index may come up empty.

Indexing exposes you. Building the index requires visiting pages. When the browser extension records a page you visit, or when you run hister index against a URL, a network request goes to the origin server. The privacy protections apply after indexing, not during it. If you index a hostile page, that page can observe the visit. Hister mitigates this with support for a Chromedp backend and configurable headers and cookies, but the fundamental exposure during indexing cannot be eliminated entirely.

Where does Hister stand?

	Online search	Metasearch	Hister
Query leaves your machine	Yes	Yes (via proxy)	No
Dependent on external index	Yes	Yes	No
Tracking when reading results	Yes	Yes	No (offline preview)
Index coverage	Comprehensive	Comprehensive	Limited to what you index
Verifiable privacy	No	Partial	Yes, free software, self-hosted

Hister is not the right tool for every search. It is the right tool when privacy is non-negotiable, when you need to be certain, not just hopeful, that your searches stay private.

If that matters to you, try the demo or follow the quickstart guide to run your own instance in minutes.

How I Use Hister

Adam Tauber — Wed, 29 Apr 2026 13:44:19 +0000

In this post I'd like to share how I (@asciimoo - the author of Hister) use Hister to maximize my productivity and privacy.

The tools I mention are interchangeable: the concepts apply equally well to different environments and operating systems.

The Search Workflow

My core workflow comes down to three steps:

Open Hister as quickly as possible
Check with the fewest keystrokes whether the result is already in Hister
Fall back to a traditional search engine if nothing relevant is found

Let me walk through each step.

1. Opening Hister Instantly

Setting Hister as your browser's default search engine is a good start, but it still requires you to switch to a browser window, open a new tab, and hit Enter before results appear.

A much faster approach is to bind a global hotkey in your window manager that jumps directly to your browser and opens a new Hister tab. With this setup, a two-key combination always puts a fresh search prompt in front of you.

I use the i3 window manager. The relevant config line is:

bindsym Mod4+s exec xdg-open "http://127.0.0.1:4433/"

xdg-open (part of the freedesktop.org xdg-utils package) opens its argument with whichever application is registered for that URL scheme: typically your default browser.

My default URL-opening application for xdg-open is a small shell script:

#!/bin/sh
chromium --incognito "$1"
i3-msg "workspace web"

This spawns a new browser window (or tab, if a browser is already running) and then tells i3 to switch to my browser workspace. If you use multiple browsers, use this instead of the last i3-msg command:

sleep 0.1 && i3-msg "[urgent=latest] focus"

This waits briefly for the new window to appear and focuses it by urgency hint, ensuring you land in the right browser regardless of which one opened.

Note: xdg-open is fully optional. You can configure your hotkeys to directly manage focus and browser opening. I prefer using xdg-open to get consistent behaviour across all my applications.

Even if your setup is completely different, the principle is the same: bind a hotkey that opens a new browser tab with Hister and focuses it.

2. Making Searches Efficient

There are three things I rely on to keep searches fast inside Hister.

Learning the Query Language

The query language lets you quickly narrow down results. I regularly use field filters, exclusions, and synonyms to get precise results in fewer keystrokes: for example, domain:github.com -type:local indexer to find GitHub pages about the indexer while excluding local files.

Search Aliases

I define aliases for recurring search patterns. There are two flavours I use:

Synonyms: when a topic has multiple common names, I use one as the alias. For example, go resolves to (go|golang) so I always find both spellings without thinking about which was used on a given page.

Targeted filters: for context-specific searches I use a ! prefix to make the alias distinct from regular words. A good example is !hi ("Hister issues"), which resolves to url:https://github.com/asciimoo/hister/issues/*. Typing !hi indexer instantly lists all Hister GitHub issues mentioning the indexer.

Aliases like these compress multi-word filter expressions into a single short token, which makes repeated searches faster.

Keyboard Navigation

Hister's hotkeys let you move through results and open the readability view (alt+v by default) entirely from the keyboard. The readability view renders a clean version of the page directly inside Hister, so I can often get the information I need without ever leaving the search interface.

Configure the hotkeys to match your habits: the defaults are a reasonable starting point (especially for vim users).

3. Falling Back to External Search

Hister adds overhead to the search workflow when the information you need isn't in your index. My goal is to make this overhead as small as possible.

There are two distinct situations:

1. You Know Before Searching That It Won't Be in Hister

Start or end your query with !! and press Enter. Hister immediately redirects to your configured external search engine with the same query. This works both from the Hister interface and from the browser's URL bar when Hister is set as the default search engine. The only overhead is two extra characters.

2. You Discover Mid-Search That the Result Isn't There

Press the configured hotkey (Alt+o by default) or click the Web link below the search input. This opens the current query in your configured search engine without you having to retype it. You avoid switching to the search engine's page and re-entering the query manually.

Keeping the Index Clean

Aside the search workflow optimiziation, efficiency can be increased with a well maintained index.

A growing index is only useful if it stays relevant. Two habits help with this.

Skip Rules

Not every page you visit is worth indexing. E.g. social media feeds add noise without adding value. I use skip rules to prevent them from landing in the index in the first place.

I like to pay attention in general to my index and add skip rules for patterns of noise.

Pruning Stale Entries

Even with good skip rules, the index accumulates entries that become irrelevant: accidentally opened content, documentation for libraries I no longer use or pages for projects I abandoned. I do pruning every time I discover useless content in my index using the delete command:

hister delete "domain:old-framework.io"
hister delete "url:https://jobs.example.com/*"

The --dry flag lets you preview what would be deleted before committing:

hister delete --dry "domain:old-framework.io"

Pre-indexing Reference Material

The browser extension indexes pages as you visit them, which means documentation you have never opened is invisible to Hister. I close this gap by using the crawler to pre-index reference material I expect to look up repeatedly.

hister index --recursive --allowed-pattern=pkg.go.dev/some/library https://pkg.go.dev/github.com/some/library

This crawls the library's documentation and adds everything to the index.

I use this for:

API and library documentation for tools I use regularly
Project wikis and internal documentation
Long-form reference pages I know I will return to

Conclusion

A bit of upfront configuration (a global hotkey, a handful of aliases, skip rules, and familiarity with the query syntax) makes Hister significantly more effective as a daily tool. Pre-indexing reference material and periodically pruning stale entries keeps the index sharp as it grows.

I'll keep refining the workflow and I'm always interested in how you use it.

Share your tips, use cases, and ideas on GitHub, Discord, Codeberg, or in #hister on IRCNet.

Data Indexing in Golang

Adam Tauber — Thu, 16 Apr 2026 18:57:14 +0000

If you need fast, content-based retrieval of large amounts of documents, your best option is to use a full-text indexer. Popular solutions like Elasticsearch and Meilisearch are more than capable of getting the job done. But what if you don't want to depend on an external service, or if you need a higher level of control over how your data is stored and searched?

Luckily, Go has an excellent library for exactly this purpose: Bleve. Bleve lets you quickly index any Go struct with sensible defaults and a built-in Google-like query language. Or you can go further and build your own query language and customize every single detail of the indexer.

Bleve is a file-based indexer that can handle millions of records. It supports concurrent reads and writes, hot-swapping of indexes, match highlighting, and much more.

Hister is built on top of Bleve and uses a wide range of its features: custom field mappings with language-specific analyzers, a hand-crafted query language with per-field boosting, cursor-based pagination, multi-language index aliases, and fine-grained Scorch tuning. The examples through this post are inspired from our codebase and the knowledge we collected during the development.

Creating a Simple Indexer

Getting started with Bleve only takes a few lines of code. The two core operations are indexing (storing a document so it can be searched later) and querying (retrieving ranked documents that match a search expression).

package main

import (
    "fmt"
    "log"

    bleve "github.com/blevesearch/bleve/v2"
)

// Document represents the data we want to index and search.
type Document struct {
    Title string
    URL   string
    Text  string
}

func main() {
    // Create a new index on disk. If one already exists at that path, open it.
    mapping := bleve.NewIndexMapping()
    index, err := bleve.New("example.bleve", mapping)
    if err != nil {
        index, err = bleve.Open("example.bleve")
        if err != nil {
            log.Fatal(err)
        }
    }
    defer index.Close()

    // Index a handful of documents. The first argument is a unique ID;
    // the second is any Go value, Bleve will reflect over its fields.
    docs := map[string]Document{
        "1": {Title: "Go Programming", URL: "https://go.dev", Text: "Go is an open source programming language that makes it easy to build reliable software."},
        "2": {Title: "Bleve Search", URL: "https://blevesearch.com", Text: "Bleve is a full-text search and indexing library for Go."},
        "3": {Title: "Hister - Your own search engine", URL: "https://hister.org/", Text: "Full-text search across your files, browsing history and beyond."},
    }

    for id, doc := range docs {
        if err := index.Index(id, doc); err != nil {
            log.Printf("failed to index %s: %v", id, err)
        }
    }

    // Query the index. NewMatchQuery performs a full-text search across
    // all indexed fields and ranks results by relevance score.
    query := bleve.NewMatchQuery("Hister search engine")
    req := bleve.NewSearchRequest(query)
    req.Fields = []string{"Title", "URL"} // which stored fields to return
    req.Size = 10                         // maximum number of hits

    results, err := index.Search(req)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Found %d result(s):\n", results.Total)
    for _, hit := range results.Hits {
        fmt.Printf("  [%.4f] %s  %s\n", hit.Score, hit.Fields["Title"], hit.Fields["URL"])
    }
}

A few things to note:

bleve.New vs bleve.Open New creates a fresh index at the given path; Open opens an existing one. The pattern shown above. Try New, fall back to Open on error is the idiomatic way to handle both the first run and subsequent runs with the same index directory.
bleve.NewIndexMapping() Returns a default mapping that works well out of the box: text fields are tokenized, lowercased, and stop-word filtered using the English analyzer. You can replace this with a custom mapping when you need more control (see the Mappings section below).
Automatic field discovery Bleve uses reflection to inspect your struct. Every exported field is automatically tokenized and made searchable with no extra configuration. Unexported fields are silently skipped.
Unique document IDs The string ID you pass to Index is how you identify documents for updates and deletes. Calling Index with an ID that already exists replaces the previous document in place, making it safe to re-index pages that have changed.
SearchRequest.Fields By default Bleve returns only document IDs and relevance scores to keep responses lean. Specify the field names you want returned in Fields, or pass []string{"*"} to get every stored field.
hit.Score Each result carries a floating-point relevance score computed by Bleve's BM25-based scorer. Higher scores indicate a stronger match. You can influence scores with boost values (covered in the Querying section).

Mappings

The default mapping works well for a quick start, but real applications usually need more control over how Bleve analyzes and stores each field. A mapping tells Bleve what type each field is, which analyzer to use when tokenizing it, whether to store the original value, and whether to include it in the index at all.

Mappings can:

Control tokenization split text into terms using whitespace, language rules, edge n-grams, etc.
Filter input data lowercase terms, strip HTML, apply stop-word lists, or run a stemmer so that "running" and "runs" match the same root token
Exclude fields from the index omit sensitive or irrelevant fields to save disk space and keep the index lean
Define custom analyzers combine any tokenizer with any chain of token filters to get exactly the behavior you need

Here is a concrete example that applies language-based stemming to the Text and Title fields, and excludes a raw HTML field from being indexed at all:

import (
    "github.com/blevesearch/bleve/v2/analysis/analyzer/en"
    "github.com/blevesearch/bleve/v2/mapping"
)

func buildIndexMapping() *mapping.IndexMappingImpl {
    // English analyzer: tokenizes, lowercases, removes stop words, and stems.
    // "running" and "runs" will both match a search for "run".
    englishField := bleve.NewTextFieldMapping()
    englishField.Analyzer = en.AnalyzerName

    // A keyword analyzer treats the entire field value as a single token
    // useful for exact-match fields like URLs or tags.
    keywordField := bleve.NewTextFieldMapping()
    keywordField.Analyzer = "keyword"

    // Disable indexing for a field we only want to store, not search.
    storedOnlyField := bleve.NewTextFieldMapping()
    storedOnlyField.Index = false

    docMapping := bleve.NewDocumentMapping()
    docMapping.AddFieldMappingsAt("title", englishField)
    docMapping.AddFieldMappingsAt("text", englishField)
    docMapping.AddFieldMappingsAt("url", keywordField)
    docMapping.AddFieldMappingsAt("raw_html", storedOnlyField) // stored but not indexed

    indexMapping := bleve.NewIndexMapping()
    indexMapping.AddDocumentMapping("document", docMapping)
    indexMapping.DefaultAnalyzer = en.AnalyzerName

    return indexMapping
}

Pass the result of buildIndexMapping() to bleve.New or bleve.NewUsing when creating the index. Mappings are baked into the index at creation time and cannot be changed afterwards. To apply a new mapping you need to create a fresh index and re-index all documents.

Querying

Bleve provides a powerful built-in text query processor called QueryStringQuery. It supports field filters (title:golang), quoted phrases ("error handling"), term exclusion (go -python), wildcard patterns (auth*), and boolean operators (go AND concurrency). Its syntax closely mirrors Google's search syntax. You can read more about it here.

But where Bleve really shines is in providing composable building blocks for constructing your own domain-specific query language. The query package exposes a wide variety of primitives. Match queries, wildcard queries, range queries, boolean combinators, and more that you can wire together however you like.

Here's a simplified example from our app to demonstrate how powerful this can be:

queries := []query.Query{}
negatedQueries := []query.Query{}
for _, keyword := range strings.Fields(queryString) {
    negated := false
    // negate the term if it starts with "-"
    if cut, ok := strings.CutPrefix(keyword, "-"); ok {
        keyword = cut
        negated = true
    }

    // WildcardQuery matches the keyword anywhere inside the URL string.
    // The 10x boost means a URL match raises the document's score
    // significantly compared to a plain text match.
    //
    // The boost number 10 is arbitrary, adjust it to your needs
    urlq := bleve.NewWildcardQuery("*" + keyword + "*")
    urlq.SetField("url")
    urlq.SetBoost(10)

    // MatchQuery tokenizes the keyword with the field's analyzer before
    // matching, so stemming and stop-word removal apply automatically.
    textq := bleve.NewMatchQuery(keyword)
    textq.SetField("text")

    // Title matches are given 50x weight. A keyword found in the title
    // is a very strong signal of relevance.
    //
    // The boost number 50 is arbitrary, adjust it to your needs
    titleq := bleve.NewMatchQuery(keyword)
    titleq.SetField("title")
    titleq.SetBoost(50)

    // DisjunctionQuery is an OR combinator: the document scores as a match
    // if it satisfies *any* of the sub-queries. The final score is taken
    // from whichever sub-query scored highest.
    disjq := bleve.NewDisjunctionQuery(
        urlq,
        textq,
        titleq,
    )

    if negated {
        negatedQueries = append(negatedQueries, disjq)
    } else {
        queries = append(queries, disjq)
    }
}

// BooleanQuery is an AND/OR/NOT combinator at the keyword level:
//   - must    (first arg):  document must satisfy every query in this list
//   - should  (second arg): optional queries that boost score when matched
//   - mustNot (third arg):  document must satisfy none of these queries
//
// The result: every non-negated keyword must appear somewhere in the
// document, while negated keywords disqualify a document entirely.
fullQuery := query.NewBooleanQuery(
    queries,
    nil,
    negatedQueries,
)

Each keyword in the input string becomes its own DisjunctionQuery that spans all three fields. The BooleanQuery then requires that all keyword disjunctions are satisfied, giving us an implicit AND between keywords and a per-field OR within each keyword. Negated keywords (prefixed with -) are placed in the mustNot list and disqualify any document that matches them.

This structure is easy to extend: you could add date-range filters, weight fields dynamically based on user preferences, or introduce special syntax for field-scoped searches.

Take a look at our query builder for a more complete real-world example.

Paging

Bleve's SearchRequest controls both the page size (Size) and the starting offset of results. A natural first instinct is to use the From field, set it to 0 for the first page, 20 for the second, and so on. This works, but it has a serious problem: Bleve must score and sort all matching documents up to From + Size on every request, making deep pages increasingly expensive in both memory and CPU. Worse, if new documents are indexed between two page requests, the offset shifts and users see duplicate or missing results.

The correct approach is to use cursor-based pagination via SearchAfter and SearchBefore. These functions resume the result stream from a known position rather than re-scanning from the beginning, which is both accurate and efficient. We learned to prefer them the hard way.

const pageSize = 20

// First page, no cursor needed.
req := bleve.NewSearchRequest(myQuery)
req.Size = pageSize
req.SortBy([]string{"_score", "_id"}) // stable sort is required for cursors

results, _ := index.Search(req)

// Subsequent pages, pass the sort key of the last hit as the cursor.
if len(results.Hits) == pageSize {
    lastHit := results.Hits[len(results.Hits)-1]
    cursor := lastHit.Sort // []string, one element per sort field

    nextReq := bleve.NewSearchRequest(myQuery)
    nextReq.Size = pageSize
    nextReq.SortBy([]string{"_score", "_id"})
    nextReq.SetSearchAfter(cursor)

    nextResults, _ := index.Search(nextReq)
    // ...
}

A few things to keep in mind:

Stable sort is required. SearchAfter uses the sort key of the last result as its cursor. If the sort key is changing the cursor become invalid.
Sort is always a []string. Even when sorting by a numeric field, Bleve serializes the sort key as a string. Read the cursor from hit.Sort[0] (or whichever index corresponds to your primary sort field) and pass it directly to SetSearchAfter.
SearchBefore works the same way but moves in the opposite direction, which is useful for implementing a "previous page" button.

Handling Multiple Indexes

Bleve can transparently manage multiple indexes at the same time through IndexAlias. An alias is a virtual index that fans a query out to several real indexes and merges their results back into a single ranked list.

This is particularly useful when you want to maintain separate indexes for different languages. Each language gets its own index with a tailored analyzer (English stemming, French stop-words, custom tokenization, etc.), but a single alias lets you search all of them at once:

enIndex, _ := bleve.Open("index_en.bleve")
frIndex, _ := bleve.Open("index_fr.bleve")
deIndex, _ := bleve.Open("index_de.bleve")

// Combine all language indexes behind a single alias.
alias := bleve.NewIndexAlias(enIndex, frIndex, deIndex)

// Query the alias exactly as you would a regular index.
req := bleve.NewSearchRequest(bleve.NewMatchQuery("Hister search engine"))
req.Size = 20
results, _ := alias.Search(req)

Aliases also make hot-swapping painless. When you need to rebuild an index (for example, to apply a new mapping), you can build the new index in the background, then atomically swap it into the alias with alias.Swap(newIndexes, oldIndexes). In-flight queries complete against the old index while new queries immediately use the new one, with zero downtime.

Fine-tuning

Bleve's performance knobs are not prominently documented, but they make a real difference under load. Configuration is passed as a map[string]any to NewUsing or OpenUsing instead of the regular New/Open functions.

config := map[string]any{
    // How long the BoltDB storage layer will wait for a write lock
    // before returning an error. Increase this if you see timeout
    // errors under concurrent write load.
    "bolt_timeout": "2s",

    "scorchPersisterOptions": map[string]any{
        // Number of goroutines that flush in-memory segments to disk
        // in parallel. More workers help throughput on multi-core machines
        // at the cost of higher memory usage during flushing.
        "NumPersisterWorkers": 4,

        // Maximum bytes each persister worker holds in memory before
        // flushing. Larger values reduce I/O by writing bigger segments,
        // but increase peak memory consumption.
        "MaxSizeInMemoryMergePerWorker": 80 * 1024 * 1024, // 80 MB

        // The persister pauses merging when the number of on-disk segment
        // files is below this threshold, reducing unnecessary write
        // amplification when the index is small or lightly loaded.
        "PersisterNapUnderNumFiles": 100,
    },

    "scorchMergePlanOptions": map[string]any{
        // Segments smaller than this size are candidates for merging.
        // Raising this value reduces the total number of segments (and
        // therefore read latency) at the cost of more merge I/O.
        "FloorSegmentFileSize": 20 * 1024 * 1024, // 20 MB
    },
}

index, err := bleve.OpenUsing("my.bleve", config)

These settings live in the Scorch storage backend, which is Bleve's default. Consult the persister source for the full list of available options and their default values.

Conclusion

Bleve is one of Go's hidden gems that deserves more attention. It lets you add full-text search to your application, without complex infrastructure. The default configuration gets you up and running in minutes, while the custom mapping system, composable query primitives, performance, debugging options and deep custimzation provides a great toolset to solve specific problems optimally.

The official documentation has gaps, but the GitHub issues and real-life open-source projects fill them in well. Check out our indexer package to see all of the above concepts working together in a production codebase.

Happy indexing.

Firefox Extension IDs: The Bad and the Ugly

Adam Tauber — Thu, 09 Apr 2026 08:26:20 +0000

If you've ever developed a web application that communicates with a browser extension, you've probably encountered the subtle but significant differences between how Chrome and Firefox handle extension identifiers. While both browsers allow developers to specify static extension IDs, their implementation approaches diverge in ways that create real problems for security, privacy, user and developer experience.

This post explores an issue I discovered while building Hister. What started as a straightforward CSRF protection implementation turned into a deep dive into Firefox's extension architecture decisions.

Both Chrome and Firefox allow extension developers to have a static extension ID in their manifest. This ID serves as a persistent identifier for the extension across different installations and updates.

In Chrome (and Chromium-based browsers), extension ID handling works exactly as you'd expect:

You specify a public key in your manifest which guarantees a static extension ID
The browser uses this ID consistently
All network requests from the extension include this ID in the Origin HTTP header
Servers can identify which extension is making requests
The ID remains the same across all installations of the extension

If your extension ID is cciilamhchpmbdnniabclekddabkifhb, every installation of your extension will use that ID, and every HTTP request will identify itself with that origin.

Firefox's approach... is different:

Firefox also lets you specify a static extension ID in the manifest. However, at the moment of installation, Firefox generates a unique "internal UUID" for each installation. This UUID is what actually appears in the Origin header of HTTP requests, not the static ID you specified.

On the surface, this might seem like a minor implementation detail. In practice, it creates significant problems.

The Bad: Breaking CSRF Protection

Cross-Site Request Forgery (CSRF) protection is a fundamental security concern for any web application. The basic problem: how do you ensure that a request to your server came from your legitimate client application and not from a malicious site?

For traditional web applications, there are well-established patterns:

CSRF tokens embedded in forms
Origin HTTP header checks
SameSite cookie attributes

But browser extensions present a unique challenge. Extension code runs independently from web pages. It's not subject to the same-origin policy in the same way. This means traditional CSRF protection mechanisms don't work.

Origin Header: The Natural Solution

The Origin HTTP header was designed exactly for this purpose. When a browser makes a cross-origin request, it includes an Origin header identifying where the request came from. For extensions, this header contains the extension ID.

In Chrome, CSRF protection for extension-to-server communication is straightforward:

app.post('/api/add', (req, res) => {
  const allowedOrigin = 'chrome-extension://cciilamhchpmbdnniabclekddabkifhb';

  if (req.headers.origin !== allowedOrigin) {
    return res.status(403).json({ error: 'Invalid origin' });
  }

  // Process the request...
});

This is secure, simple, and requires no user interaction. The extension can make "authenticated" requests to your server, and you can verify they're coming from your legitimate extension, not from a malicious website or a rogue extension.

With Firefox's unique internal UUID per installation, this pattern becomes impossible: You cannot allowlist a specific origin because you don't know what the UUID will be. Each user who installs your extension gets a different UUID.

The Workaround: Manual Configuration

The only reliable solution is to require users to manually configure a shared secret:

User installs your extension
Server generates a secret token
User manually copies this token into the extension's settings
Extension includes the token in all requests
Server validates the token instead of the Origin header

This works, but it's terrible UX:

Extra setup steps discourage users
High potential for user error
Token management becomes the user's problem
Can't automatically validate origin at the HTTP layer

The Ugly: Privacy Implications

While breaking CSRF protection is bad for developers, Firefox's internal UUID approach has even more troubling implications for user privacy.

A Built-in Tracking Mechanism

The internal UUID is unique per browser installation, persistent across websites, and completely unavoidable. This way of tracking is even worse than cookies:

Tracking cookies:

Can be blocked by browser settings
Can be cleared by the user
Subject to SameSite policies
Users are increasingly aware of them
Privacy tools can block them

Firefox extension internal UUIDs:

❌ Cannot be disabled
❌ Cannot be cleared (except by reinstalling)
❌ Persist across all websites
❌ Invisible to users (not shown in extension details)
❌ Not affected by privacy tools or private browsing
❌ Unique to each browser installation

Why Did Firefox Do This?

I don't have a clear answer to that. Mozilla mentions "sandboxing and security" reasons. But, for me neither of the arguments validate the usage of "internal UUID" in the Origin HTTP header.

I can speculate on why Firefox implemented internal UUIDs:

Possible reason 1: Security isolation

Perhaps the intent was to provide better security isolation between different extension installations. If each installation has a unique ID at the browser level, it's theoretically harder for one malicious extension to impersonate another.

However, this benefit is questionable. Extension IDs are already validated by the browser. A malicious extension can't fake someone else's ID because the browser controls the Origin header generation and the extension installation process as well.

Possible reason 2: Migration from legacy extension system

Firefox underwent a major transition from legacy XUL extensions to WebExtensions. The internal UUID system might be a holdover from the legacy architecture that was never fully reconsidered.

Possible reason 3: Accidental consequence

It's possible this wasn't a deliberate design decision at all, but rather an accidental consequence of how Firefox's extension system was architected.

Whatever the reason is, the current behavior has serious flaws.

You know the issue is serious when even Chrome has a more privacy-respecting solution to the problem

UPDATE (2026.02.16)

Seems like their goal was to prevent extension fingerprinting.

The Developer Perspective

As someone building an free software project that prioritizes privacy and local-first architecture, Firefox's behavior is frustrating:

For users:

Firefox users get a worse experience (manual configuration)
The browser marketed for privacy actually creates privacy issues
No transparency about the internal UUID system

For developers:

Can't implement proper CSRF protection via Origin header
Must implement workarounds that harm UX
Documentation becomes more complex
Testing is harder (can't easily simulate multiple Firefox installations)

What Should Firefox Do?

The solution is straightforward: use a static extension ID in the Origin HTTP header, just like Chrome does.

Disclaimer

While I've spent significant amount of time researching and trying to find ways to resolve these issues, it can easily happen that I've completely missed something and there is solution to either or both of the mentioned problems. In this case please contact me at @asciimoo@chaos.social on Mastodon.

How I Cut My Google Search Dependence in Half

Adam Tauber — Thu, 02 Apr 2026 15:05:43 +0000

TL;DR: I built Hister, a self-hosted web history search tool that indexes visited pages locally. In just 1.5 months, I reduced my reliance on Google Search by 50%.

The Problem: Online Search Isn't What It Used to Be

Like many developers and knowledge workers, I found myself constantly reaching for Google Search throughout my workday. It had become such an ingrained habit that I barely noticed how often I was context-switching away from my actual work to perform searches. But over time, something had changed about the experience. The search results that once felt reliable and helpful were increasingly problematic in several ways.

Too Many Advertisements

What used to be a clean list of relevant links now requires scrolling past multiple sponsored results, shopping suggestions, and promoted content just to reach the organic results. Often, the actual information I'm looking for doesn't appear until halfway down the page, after I've mentally filtered out all the commercial noise.

Manipulative SEO Tactics

Organic results themselves have been manipulated by SEO tactics rather than truly reflecting the most relevant and helpful content. Websites optimized for search engines rather than humans dominate the rankings, while genuinely useful resources from smaller sites or personal blogs get buried on page two or three. The signal-to-noise ratio has degraded significantly.

AI Suggestions

Google has recently added AI-generated summaries at the top of many search results. While sometimes helpful, these summaries often miss crucial nuance, provide oversimplified or occasionally incorrect information, and add yet another layer between me and the actual source material I'm trying to find. For technical queries where precision matters, these AI answers can be misleading or incomplete.

Lack of Privacy

Google tracks every query I make, building a detailed profile of my interests, work patterns, and information needs. This data is used for ad targeting and who knows what else. The convenience of search comes at the cost of giving away intimate details about my work and life.

The Insight

But the realization that pushed me to build a solution was that I was often searching for pages I'd already visited. That documentation page I read last week but forgot to bookmark. That GitHub issue I commented on yesterday but couldn't remember the exact project name. Those internal wiki pages with crucial information about our infrastructure. I was using Google as a personal memory aid, outsourcing my recall to an external service that was tracking my every query. And for content behind authentication (internal tools, documentation, private repositories) Google couldn't help at all, since it can't index pages it can't access.

Two Types of Search

Thinking on how to replace Google led me to a crucial realization about the nature of search itself. When we type queries into a search box, we're actually doing one of two fundamentally different things, even though the interface is identical:

Discovery Search: Finding New Information

Discovery search is what we typically think of when we imagine "searching the internet". It's about finding information we've never encountered before. This is true exploration, we're venturing into unknown territory, discovering new resources, learning about topics we're unfamiliar with, and finding answers to questions we've never asked before. For this type of search, we genuinely need the vast index of the internet that services like Google provide. We need to cast a wide net and see what the collective knowledge of the web has to offer.

Recall Search: Refinding Known Information

But then there's the other type of search what I call "recall search". This is when we're trying to find information we've already encountered. We're not discovering something new; we're trying to remember where we saw something. Examples of this include searches like "That authentication bug I fixed last month..." when you remember solving a problem but can't recall the exact solution, or "The Bleve docs page about result highlighters..." when you know you've read the documentation before but can't remember the specific URL or section title. Another common example: "That Stack Overflow answer about async/await..". when you remember reading a particularly clear explanation but didn't save the link.

A significant portion of my daily searches — probably more than half — were recall searches, not discovery searches.

The revelation that changed everything for me was this: A significant portion of my daily searches - probably more than half - were recall searches, not discovery searches. I was constantly using Google to search my own browsing history, to refind pages I'd already visited and information I'd already read. But Google's interface treats both types of search identically, and it has no special optimization for helping you refind your own content. Worse, for pages behind authentication or on private networks, Google can't help at all because it can't index content it can't access.

This insight suggested a solution: What if I had a dedicated tool optimized specifically for recall search for refinding my own browsing history, and only fall back to Google for true discovery search?

The potential benefits were enormous:

faster results
better privacy
access to authenticated content
results tailored specifically to my interests and work

The Solution: Index Everything Locally

The solution seemed obvious once I'd articulated the problem: what if I could search my entire browsing history - including the full page content, not just URLs and titles - locally and privately? This would give me a personal search engine optimized specifically for recall search, while still allowing me to fall back to Google for discovery search when needed.

I started looking for existing solutions. Surely someone had built this before? Browser history exists, but it only stores URLs and page titles, making it nearly useless for finding pages based on their content. Some note-taking apps like Evernote or Notion offer web clippers, but they require manual action for each page you want to save. Personal knowledge management tools like Omnom exist, but they're focused on curated notes rather than comprehensive browsing history, but they require conscious decisions about what to save.

None of the existing tools I found met all my requirements. I needed something that combined the comprehensive automatic capture of browser history, the full-text search capabilities of a search engine, the performance of local software, and the privacy of self-hosted solutions. Since nothing existed that checked all these boxes, I decided to build it myself.

What I Needed

The requirements for my ideal solution were clear:

Fast lookup If searching my local index took longer than just Googling, I'd never use it. I needed instant, sub-second search response times, keyboard shortcuts to make it faster to search locally than to context-switch to Google.

Automatic indexing I didn't want to manually save pages or make conscious decisions about what to index. It needed to capture pages as I browse with zero manual work on my part. The tool should disappear into the background and just work.

Authentication aware indexing So much of the content I reference daily is behind authentication: internal wikis, private documents, authenticated API documentation, internal dashboards. Any solution that couldn't handle authenticated content would miss a huge portion of my actual browsing.

Full-text search Meant searching the actual page content, not just URLs and titles. Browser history is useless when you remember reading something about "microservice authentication patterns" but can't remember which blog or doc site it was on. I needed to be able to search the words within the pages.

Powerful query capabilities Like Boolean operators (AND, OR, NOT), field-specific searches (search only URLs, or only titles), and wildcard matching would make it possible to narrow down results quickly.

Zero cognitive overhead The tool needed to work seamlessly in my workflow. It should integrate naturally with how I already browse and search.

Transparent fallback to online search engines If I searched locally and didn't find what I wanted, I should be able to immediately fall back to Google with the same query, making adoption gradual rather than requiring a complete workflow change.

Fine-tuning capabilities Let me customize the experience over time. I wanted to be able to blacklist irrelevant sites I never want to see again, prioritize important sources, and create keyword aliases for common searches.

Offline preview of saved content I could read indexed pages even if the original site went down or the page was deleted; a nice bonus that would occasionally save me from link rot.

Import existing history I wanted to start with years of browsing data already indexed, rather than building up an index from scratch over months.

Free software Self-Hosted, with no recurring costs or vendor lock-in. My browsing history is my personal data, it should not be owned by companies.

No existing tool checked all these boxes. So I decided to build Hister.

Introducing Hister

Hister is a self-hosted web history management tool that treats your browsing history as a personal search engine.

The Results: 50% Reduction in 1.5 Months

After using Hister for six weeks, I analyzed my search patterns:

~50% of my Google searches now answered locally
Found content Google couldn't (authenticated pages, deleted content)
Zero privacy concerns No tracking, no profiling
Better results for my specific needs (because it's MY history)

The more I use it, the better it gets. My local index is now:

More relevant than Google for my common queries
As fast as opening a new browser tab
Comprehensive across authenticated services
A personal knowledge base of everything I've read

Unexpected Benefits

Rediscovery: I'm finding valuable content I'd forgotten about. That article I bookmarked 2 years ago but never revisited? Now it shows up in relevant searches.

Learning patterns: Seeing what I search for reveals my knowledge gaps and interests.

Offline access: When documentation sites go down or pages get deleted, I still have the content.

Conclusion

We've accepted that search means "go to Google" for so long that we've forgotten there are alternatives. But for a huge portion of my daily searches - probably more than half - I don't need the entire internet. I need OUR internet: the pages I've read, the docs I've opened, the internal tools I use daily.

Hister isn't trying to replace Google for discovery. It's trying to replace Google for recall. And in that domain, it's already better than Google could ever be, because:

It knows about authenticated pages Google will never see
It searches YOUR history, not the entire web
It's instant, private, and ad-free
It gets better the more you use it

After 1.5 months, I've cut my Google dependence in half. I expect this number will increase as my index grows.

If you're a developer, researcher, or knowledge worker who constantly re-searches for information you've already found, give Hister a try. It might just change how you find information on the internet.

Before Hister:

Open Google
Search: "bleve query"
Click first result (probably wrong)
Click second result (looks familiar…)
Realize I've been here before
Finally find the specific page I wanted

Time: ~1-2 minutes, 5-10 clicks

With Hister:

Open Hister
Type: "bleve query", press enter
First result is opened with the EXACT page I visited last month

Time: ~5 seconds, few keystrokes

Take Back Your Search

To get started with Hister check out the following links:

Future Development

I'm actively developing Hister with these goals:

Improve usability
Add automatic indexing capabilities based on the index and opened results
Find a secure and privacy respecting way to connect local Hister's to a distributed search engine
Export search results
Advanced analytics and search insights

Hister is open source (AGPLv3) and welcomes contributions!

Ways to Contribute

🐛 Report bugs and suggest features on GitHub Issues
💻 Submit pull requests (check out good first issues)
📖 Improve documentation
🎨 Design better UI/UX
🌍 Translate to other languages
⭐ Star the repo and spread the word!

Have questions or feedback? Open an issue on GitHub or reach out to @asciimoo.

How to Scrape Instagram Profiles

Adam Tauber — Mon, 13 Nov 2017 00:00:00 +0000

Scraping can be tedious work especially if the target site isn't just a standard static HTML page. Plenty of modern sites have JavaScript only UIs where extracting content is not always trivial. Instagram is one of these websites, so I would like to show you how it is possible to write a scraper relatively fast to get images from Instagram. I'm using Colly, a scraping framework for Golang. The full working example can be found here.

Information gathering

First, if we view the source code of a profile page (e.g. https://instagram.com/instagram), we can see a bunch of JavaScript code inside the body tag instead of static HTML tags. Let's take a closer look at it. We can see that the first script is just a variable declaration where a huge JSON is assigned to a single variable (window._sharedData). This JSON can be easily extracted from the script tag by finding the first { character and getting the whole content after it:

jsonData := scriptContent[strings.Index(scriptContent, "{") : len(scriptContent)-1]

Note that because it is a JavaScript variable declaration it has a trailing semicolon what we have to cut off to get a valid JSON. That's why the example above ends with len(scriptContent)-1.

The formatted view of the extracted JSON reveals all the information we are looking for. The JSON contains information about a user's images and some metadata of the profile (e.g. the profile ID is 25025320). There is an interesting part of the metadata called page_info:

 "page_info": {
  "has_next_page": true,
  "end_cursor": "AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA"
 }

Probably, the value of end_cursor' is the attribute of the URL to get the next page when has_next_page is true.

Format JSONs with the handy jq command line tool

Paging

The next page of the user profile is retrieved by an AJAX call, so we have to use the browser's Network Inspector to find out what is required to fetch it. Network Inspector shows a long and cryptic URL which has two GET parameters query_id and variables:

https://www.instagram.com/graphql/query/?query_id=17888483320059182&variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA%22%7D

It seems like Instagram uses a GraphQL API and the value of variables GET parameter is an URL encoded value. We can decode it with a single line of Python code:

$ python -c 'import urlparse;print(urlparse.parse_qs("variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA%22%7D")["variables"][0])'
{"id":"25025320","first":12,"after":"AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA"}

As you can see it is a JSON object and the value of the after attribute is the same as the value of the end_cursor and id is the ID of the profile.

The only unknown information in the next page URL is the query_id GET parameter. The HTML source code does not contain it, nor the cookies or response headers. After a little bit of digging it can be found in a static JS file included in the main page and seems it is a constant value.

The format of the response is also JSON but the structure is different from what we've found on the main page. This JSON contains the same information as the previous one, however we cannot use the same method to extract data due to structural differences.

Building the scraper

The information gathering phase clearly shows that we need four building blocks to be able to fetch all images found on an Instagram profile. Let's do it using Colly.

Extract and parse JSON from the main page

To extract content from HTML we need a new Collector which has a HTML callback to extract the JSON data from the script element. Specifying this callback and when it must be called can be done in OnHTML function of Collector.
The JSON can be easily converted to native Go structure using json.Unmarshal from the standard library.

c := colly.NewCollector()

c.OnHTML("body > script:first-of-type", func(e *colly.HTMLElement) {
    // find JSON string
    jsonData := e.Text[strings.Index(e.Text, "{") : len(e.Text)-1]

    // parse JSON
    data := struct {
       EntryData struct {
           ProfilePage []struct {
               User struct {
                   Id    string `json:"id"`
                   Media struct {
                       Nodes []struct {
                           ImageURL     string `json:"display_src"`
                           ThumbnailURL string `json:"thumbnail_src"`
                           IsVideo      bool   `json:"is_video"`
                           Date         int    `json:"date"`
                           Dimensions   struct {
                               Width  int `json:"width"`
                               Height int `json:"height"`
                           }
                       }
                       PageInfo pageInfo `json:"page_info"`
                   } `json:"media"`
               } `json:"user"`
           } `json:"ProfilePage"`
       } `json:"entry_data"`
    }{}
    err := json.Unmarshal([]byte(jsonData), &data)
    if err != nil {
        log.Fatal(err)
    }

    // enumerate images
    page := data.EntryData.ProfilePage[0]
    actualUserId = page.User.Id
    for _, obj := range page.User.Media.Nodes {
        // skip videos
        if obj.IsVideo {
            continue
        }
        c.Visit(obj.ImageURL)
    }
    // ...
}

Create and visit next page URLs

The format of the next page URL is fixed, so a format string can be declared which accepts the changing id and after parameters.

const nextPageURLTemplate string = `https://www.instagram.com/graphql/query/?query_id=17888483320059182&variables={"id":"%s","first":12,"after":"%s"}`

Parse next page JSONs

This is pretty much the same as the conversion of the main page's JSON except these responses have some different attribute names (e.g. the image url is display_url instead of display_src).

Download and save images extracted from JSONs

After requesting images from Instagram using the Visit function, responses can be handled in OnResponse. It requires a callback as a parameter which is called after the response has arrived. To select responses which include images, we should filter based on Content-Type HTTP header. If it is image, it must be saved.

c.OnResponse(func(r *colly.Response) {
    if strings.Index(r.Headers.Get("Content-Type"), "image") > -1 {
        r.Save(outputDir + r.FileName())
        return
    }
    // handle further response types...
}

Epilogue

Scraping JS-only sites isn't always trivial, but can be handled without headless browsers and client side code execution to achieve great performance. This scraper example downloads approximately 1000 images a minute on a single thread over a regular home Internet connection.

It can be tweaked further to handle videos and extract meta information.