Forem: Patrick Londa

Breaking Logging's Flywheel of Compromises

Patrick Londa — Tue, 19 May 2026 18:26:09 +0000

Authored by Mike Neville-O'Neill

Let's face it — logging is broken. Not just a little broken, but fundamentally misaligned with the needs of modern engineering teams. At a recent AWS Summit talk in London, Benoit Gaudin (our Head of Infrastructure) and I shared Bronto's vision for fixing this mess once and for all.

The Problem We're All Living In

If you're running any significant infrastructure today, you're probably stuck in what we call the "3C flywheel of compromises":

Cost — Logging at scale has become ridiculously expensive
Coverage — So you cut corners, dropping those infra logs and long-tail workflows
Complexity — And end up with a Frankenstein's monster of 5–8 different systems duct-taped together

This isn't just inefficient — it's actively harmful. Engineers end up building parallel solutions just to get basic visibility because the main tool is too limited, too slow, or too expensive.

Logs Matter More Than Ever

Logs aren't just a compliance checkbox anymore. They're your operational ground truth in the AI era.

They feed your LLMs. They power your agents. They're your audit trail, your RAG source, your behavioral training set. And one log message from an LLM-based system might contain 50–100 nested events in a single payload.

Try scaling that with a solution built before the separation of compute and storage was even a thing.

How We're Breaking the Cycle

Bronto was built to tackle this head-on with three non-negotiable capabilities:

Subsecond search on all logs — whether they're two seconds or two years old
Petabyte-scale retention — no infrastructure for you to manage
Completely different pricing — think cents per GB, not dollars

The platform is built natively on AWS (S3, Lambda, DynamoDB), but engineered so you don't have to deal with pipelines, pre-processing, or glue code.

Bronto's Architectural Advantage

The ingestion layer accepts data from standard sources — OpenTelemetry Collector, FluentD, FluentBit — through HTTP endpoints, with AWS EC2 load balancers doing the heavy lifting. Data is buffered through Kafka (AWS MSK), but then things diverge from the standard playbook.

Instead of traditional approaches, data is processed from Kafka and written to S3 in a proprietary format that borrows techniques from data analytics: data partitioning, Bloom filtering, push predicates, compression, and columnar-based formats. Metadata lives in DynamoDB for speed.

The real magic happens at search time. When you query through the UI or API, Lambda functions launch in parallel and process data directly from S3. No overprovisioning for big queries — horizontal scaling on demand, paying only while functions run.

This architecture is what enables both the performance (subsecond on terabytes, seconds on petabytes) and the pricing model. No expensive clusters running 24/7 — just cloud resources used exactly when and where they're needed.

Real Teams, Real Results

API-First Content Platform

A team running a massive content delivery platform, serving APIs behind a global CDN for websites, mobile apps, and e-commerce systems. Every request hits their API with a unique key — they need to trace errors, group by status codes, and export logs to their own customers.

Before Bronto

40TB monthly ingestion cap
30+ minute query times (when they worked at all)
Dashboards that routinely failed
Constant budget pressure

After Bronto

Boosted ingestion to 60TB monthly
Cut their logging bill in half
Complex multi-day queries now return in subseconds
Built reliable log exports for their own customers

Their exact words? "Bronto changed our lives." A logging tool. Actually improving engineers' lives.

Global SaaS Project Management Platform

A company running a suite of SaaS tools across distributed cloud services and product lines.

Before Bronto

Graylog for live logs
S3 for long-term storage
HAProxy logs dumped into S3 with gnarly Athena queries
A mix of Athena, Superset, and QuickSight for analytics
Just 1–2 days of retention across most systems

After Bronto

Everything centralized — HAProxy, Kubernetes, application logs, audit trails
Extended to 90-day hot retention
Real dashboards tracking error spikes, traffic anomalies, and app version drift
Engineers focused on product, not maintaining logging infrastructure

They went from managing logs to actually using them.

Logs as Your Secret Weapon

Your log data is massively undervalued — not because it lacks signal, but because current tooling hides that signal behind cost barriers, friction, and compromises.

Logs used to be a liability. With the right approach, they can be your secret weapon.

We're building Bronto to be for logging what Dyson was for vacuum cleaners, what iPhone was for smartphones, and what Tesla was for electric cars — a complete reimagining of what's possible when you refuse to accept the status quo.

After all, when was the last time your logging tool made your life better instead of worse?

See Bronto in Action

The CDN Logging Crisis

Patrick Londa — Tue, 19 May 2026 17:39:56 +0000

Authored by Benoit Gaudin

Every second, your CDN is generating thousands of logs that tell a critical story about your application's performance, security, and user experience. For large enterprises, this can mean terabytes of log data every day — data that contains invaluable insights about your business.

But here's the uncomfortable truth: most organizations capture only a small fraction of their CDN logs, and retain that limited data for just days or weeks. This isn't because engineering teams don't understand the value. It's because the economics of traditional logging solutions make comprehensive CDN logging prohibitively expensive.

The result? Critical blind spots that can be extremely costly during outages, security breaches, or major events.

Welcome to the flywheel of compromises:

Cost — Traditional logging vendors charge egregious per-GB rates that make comprehensive CDN logging unaffordable
Coverage — Companies respond by severely limiting what logs they collect and how long they retain them
Complexity — To compensate for coverage gaps, teams cobble together 5–8 different logging solutions, creating a management nightmare

The Current State of CDN Logging

The observability sector today resembles markets before transformative innovation — vacuum cleaners before Dyson, mobile phones before iPhone, electric cars before Tesla. Existing solutions were designed for a completely different era: before the separation of compute and storage, before the explosion of log data volumes, and certainly before the demands of the AI era.

Consider how most logging vendors operate today:

Datadog charges around $2–5 per GB for log ingestion with 15-day retention. A company generating 10TB of CDN logs daily could pay upwards of $600,000 per month
Splunk forces customers into complex licensing schemes that effectively limit how much data they can realistically log
New Relic and other vendors offer marginally better pricing but still force unacceptable trade-offs between cost and coverage

What's most frustrating is that these pricing models persist despite dramatic changes in the underlying technology. The separation of compute and storage has revolutionized data economics across virtually every other category of software, yet logging vendors continue to operate on business models created 15 years ago.

A Hypothetical (But Entirely Plausible) Scenario

To illustrate the real-world impact of incomplete CDN logging, consider this:

A week before a major live streaming event, a provider's engineering team makes a routine CDN configuration change. Under normal traffic loads, the misconfiguration goes unnoticed — cache hit ratios remain stable and performance appears normal.

After a week, any trace of the configuration change disappears from their logs due to their 7-day retention policy. Capacity planning teams review infrastructure and assume current backend capacity can handle the anticipated load — after all, it worked fine during the last similar event. Unfortunately, the now-invisible change makes that assumption dangerously wrong.

During the live event, CDN cache efficiency plummets under heavy load. Backend servers get hit much harder than expected. Users experience buffering and connection problems, but the operations team struggles to diagnose the root cause.

By the time they identify the issue — tracing it back to the forgotten configuration change — the damage is done. Over a million viewers have abandoned the stream, social media is flooded with complaints, and the company's stock takes a hit.

With complete CDN logging and longer retention, they could have:

Identified when the degradation trend first appeared, correlating it to the configuration change
Maintained visibility throughout the planning period
Quickly correlated the performance issues with the earlier change during the incident

Limited logging coverage transformed a minor configuration error into a major business incident. The cost of their logging "savings"? Potentially millions in lost ad revenue and subscription cancellations.

The Three Horsemen of the Logging Apocalypse

Cost Explosion

Traditional logging vendors price their products based on data volume, charging premium rates for both ingestion and storage. This pricing model was created when storage was genuinely expensive. In 2025, with cloud storage costs continuing to plummet, this model serves primarily to protect vendor margins.

For CDN logs — which are high-volume by nature — this creates an impossible equation. When faced with estimates of $500,000+ monthly for complete CDN logging, even the most data-driven organizations are forced to compromise.

Coverage Sacrifice

The inevitable result of cost pressure is reduced coverage. Organizations typically:

Ingest only a sample of the data
Limit retention to days instead of months
Exclude high-volume CDNs or regions entirely
Drop detailed fields that would aid troubleshooting

These compromises create dangerous blind spots. Intermittent issues, security threats that develop over time, and regional performance problems remain invisible. When an incident occurs, teams often discover they're missing exactly the data they need.

Complexity Creep

To compensate for coverage limitations, organizations implement a patchwork of supplementary solutions:

Self-hosted ELK stacks for longer-term storage (with all the maintenance overhead)
Cloud provider-specific logging solutions (AWS CloudWatch, GCP Logging)
Custom scripts to archive logs to object storage with rehydration workflows
Open-source tools for log analysis and visualization

The result is a Frankenstein's monster of logging infrastructure that no one fully understands, requires constant maintenance, and still fails to provide comprehensive visibility.

CDN Logging for the AI Era

These challenges are escalating as we enter the AI era:

Exploding volumes — Microservices, containers, and edge computing are all contributing to the data deluge
AI-powered analysis — ML systems require comprehensive, long-term data to identify patterns and anomalies effectively
Agentic applications — Autonomous applications require complete historical data to make intelligent decisions

Legacy logging business models simply cannot accommodate these realities. They weren't designed for terabytes of daily log ingestion, years of retention, or a world where AI agents might need to analyze months of historical CDN patterns.

A Different Approach

Solving the CDN logging crisis requires rebuilding the logging stack from the ground up — not incremental improvements on broken foundations. Three core principles drive the right approach:

1. Economics Aligned with Modern Infrastructure

Leveraging the separation of compute and storage to deliver CDN logging at a fraction of traditional costs:

90% cost reduction compared to Datadog and similar vendors
12-month retention by default
No charges for search or compute resources

2. Lightning-Fast Search Across Petabytes

"Tracey's Law": the faster you make log search, the more valuable logging becomes to an organization.

Sub-second search across terabytes of CDN logs
Seconds-long queries across petabytes
No rehydration from cold storage, ever
Fast dashboards even across months of data

When queries return in seconds instead of minutes (or timing out entirely), teams use logging data proactively rather than as a last resort.

3. A Single Unified Logging Layer

Eliminating the patchwork by providing one comprehensive logging layer:

All CDN providers in one place
Drop-in replacement for existing solutions
Two-line configuration change for implementation
Automatic parsing and PII removal

Breaking Free from the Flywheel

The CDN logging crisis isn't just a technical problem — it's a business problem with real implications for reliability, security, and user experience. For too long, organizations have accepted a dysfunctional status quo because there seemed to be no alternative.

"Every single word about the logging crisis resonates. We were spending over $400,000 monthly on CDN logging with Datadog, and still only capturing about 20% of our logs. With Bronto, we now have 100% coverage, 12-month retention, and our bill is under $40,000."

This isn't an incremental improvement — it's a fundamental reinvention of how logging works. Just as Apple reinvented the smartphone, Dyson reinvented the vacuum cleaner, and Tesla reinvented the electric car, the logging industry is overdue for the same transformation.

Bronto is reinventing logging from the ground up for the AI era. The team brings 150+ years of collective logging domain expertise, with previous experience building and scaling logging platforms at IBM, Rapid7, and Logentries.

See What 100% CDN Log Coverage Looks Like

Logging Your AI Events (from Ollama) in Bronto

Patrick Londa — Tue, 19 May 2026 16:16:59 +0000

Authored by David Tracey

Many software companies are investigating the use of Large Language Models (LLMs) in their products. At Bronto we've announced our Bronto Labs initiative, with AI features including auto-parsing, AI dashboard creation, and Bronto Scope for error investigation.

This post explores a different angle: using logs in the development of AI applications. We'll focus on Ollama — an open source tool for running LLMs locally — and show how to pipe its logs into Bronto for search and analysis.

LLMs are complex, non-deterministic systems. Beyond traditional logging use cases (performance monitoring, API usage), their unpredictable nature increases the need for logging — particularly to record and track responses to prompts. Individual log events can be large when they include a full prompt or response. Meta found this problem significant enough at their scale to build a dedicated Meta AI Logging Engine.

The fundamental requirements for logging AI applications are:

Ability to handle large log events
Ability to handle high volumes at low cost
Ability to search across high volumes quickly

These are exactly the requirements Bronto was designed to meet.

Setting Up Ollama

Recommended specs:

16GB RAM (8GB works for smaller models)
12GB disk space for Ollama and basic models
Modern CPU with at least 4 cores (8 preferred)
Optional: GPU for improved performance

Install and Run the Server

Install from ollama.com/download for your OS, then start the server:

ollama serve

You'll see output including the default port it's listening on (11434).

Download and Run a Model

# Pull a model from the registry
ollama pull gemma:2b

# List downloaded models
ollama list

# Run a model interactively
ollama run gemma:2b

The run command gives you a >>> prompt where you can enter prompts or /help for commands.

Sending Ollama Logs to Bronto

Step 1: Configure Ollama Logging to File

Stop the server and restart it writing logs to a file:

ollama serve > /your_log_path/.ollama/logs/server.log 2>&1

For more detailed debug logs, add to your shell profile (.zprofile etc.):

export OLLAMA_LOG_LEVEL=DEBUG
export OLLAMA_DEBUG=true

To redirect model client logs:

# stderr only (keeps console interactive)
ollama run gemma:2b 2>>/your_log_path/.ollama/logs/gemma.log

# both stdout and stderr (API use only — disables console input)
ollama run gemma:2b > /your_log_path/.ollama/logs/gemma.log 2>&1

Verify logs are flowing:

tail -f /your_log_path/.ollama/logs/server.log

Step 2: Install OpenTelemetry Collector

Download for your platform from opentelemetry.io. Example for Mac ARM64:

curl --proto '=https' --tlsv1.2 -fOL \
  https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.114.0/otelcol-contrib_0.114.0_darwin_arm64.tar.gz

chmod +x otelcol-contrib
mv otelcol-contrib /usr/local/bin/otelcol

# Verify
otelcol --version

Step 3: Configure OpenTelemetry to Forward to Bronto

Create /etc/otelcol/config.yaml:

receivers:
  filelog/Ollama_Server:
    include:
      - /your_log_path/.ollama/logs/server.log
    resource:
      service.name: LaptopServer
      service.namespace: Ollama

  filelog/Ollama_Gemma:
    include:
      - /your_log_path/.ollama/logs/gemma.log
    resource:
      service.name: LaptopGemma
      service.namespace: Ollama

processors:
  batch:

exporters:
  otlphttp/brontobytes:
    logs_endpoint: "https://ingestion.us.bronto.io/v1/logs"
    compression: none
    headers:
      x-bronto-api-key: replace_this_with_your_bronto_apikey

service:
  pipelines:
    logs:
      receivers: [filelog/Ollama_Server, filelog/Ollama_Gemma]
      processors: [batch]
      exporters: [otlphttp/brontobytes]
  # Useful for debugging:
  # telemetry:
  #   logs:
  #     level: "debug"
  #     output_paths: [/your_log_path/otelcol/debug.log]

Validate and run:

otelcol validate --config=/etc/otelcol/config.yaml
otelcol --config=/etc/otelcol/config.yaml

A Simple Ollama API Program

The Python script below (ollama-log-demo.py) uses the Ollama API to send prompts against a log file and print the response. Example usage:

# Summarize 100 lines of CDN logs
python3 ollama-log-demo.py 100lines-CDN-log.csv \
  --model "gemma:2b" \
  --prompt "You have been given 100 lines from a CDN log in CSV format. Summarise the logs provided."

# Find errors and suggest fixes
python3 ollama-log-demo.py 100lines-search-log.csv \
  --model "gemma:2b" \
  --prompt "Find errors in this log and suggest how to fix them"

The final line of each Ollama response includes useful performance metadata:

Field	Description
`total_duration`	Total time spent generating the response
`load_duration`	Time spent loading the model (nanoseconds)
`prompt_eval_count`	Number of tokens in the prompt
`prompt_eval_duration`	Time spent evaluating the prompt (nanoseconds)
`eval_count`	Number of tokens in the response
`eval_duration`	Time spent generating the response (nanoseconds)
`context`	Conversation encoding — pass in next request to maintain memory
`response`	Empty if streamed; full response if not streamed

Model notes from testing: gemma:2b is good for summarizing but tends to give high-level summaries even when asked for specifics. mistral takes longer but produces more detailed, data-specific responses. Defining the right prompt for your use case is key.

Searching Ollama Logs in Bronto

Ollama server logs include a mix of structured and unstructured entries:

Standard log levels:

INFO [main] HTTP server listening | hostname="127.0.0.1" port="11434"
level=INFO source=sched.go:714 msg="new model will fit in available VRAM"
level=DEBUG source=memory.go:103 msg=evaluating library=metal gpu_count=1

Model and resource logs:

llm_load_print_meta: max token length = 93
llama_model_loader: - kv 0: general.architecture str = gemma
level=INFO source=server.go:105 msg="system memory" total="8.0 GiB" free="1.2 GiB"

Even a small test with short prompts generates surprisingly large log volumes — 244 events totaling ~2MB in our test. Bronto handles these unstructured and semi-structured formats natively, and you can add a custom parser to make them more convenient to search and view.

Example searches in Bronto:
Fig.1 — Searching for log events containing "tokens"

Fig.2 — Searching for log events containing "prompt"

Fig.3 — Grouping by prompt evaluation time per task_id

Conclusion

This post introduced Ollama as an example of an LLM system and explained why AI applications create unique logging challenges — large events, high volumes, non-deterministic outputs, and distributed agents. We walked through setting up Ollama locally, configuring OpenTelemetry to forward logs to Bronto, and writing a simple Python API program to experiment with prompts against log data.

Future posts will develop the theme further with other AI systems including AWS Bedrock.

Appendix: `ollama-log-demo.py`

import argparse
import json
import requests


def print_ollama_stats(json_response):
    load_duration = json_response.get("load_duration")
    if load_duration:
        print("\n--- load_duration = ", load_duration)

    total_duration = json_response.get("total_duration")
    if total_duration:
        print("\n--- total_duration = ", total_duration)

    eval_duration = json_response.get("eval_duration")
    if eval_duration:
        print("\n--- eval_duration = ", eval_duration)

    prompt_eval_duration = json_response.get("prompt_eval_duration")
    if prompt_eval_duration:
        print("\n--- prompt_eval_duration = ", prompt_eval_duration)

    prompt_eval_count = json_response.get("prompt_eval_count")
    if prompt_eval_count:
        print("\n--- prompt_eval_count = ", prompt_eval_count)

    eval_count = json_response.get("eval_count")
    if eval_count:
        print("\n--- eval_count = ", eval_count)


def examine_log_with_prompt(file_path, input_prompt, input_model):
    with open(file_path, 'r') as file:
        log_data = file.read()

    req_params = {
        "model": input_model,
        "prompt": f"{input_prompt}\n\n{log_data}"
    }

    try:
        # Update localhost URL to match your Ollama API endpoint
        response = requests.post(
            "http://localhost:11434/api/generate",
            headers={"Content-Type": "application/json"},
            data=json.dumps(req_params),
            stream=True
        )
        if response.status_code == 200:
            print("\n--- Processing Successful Ollama Response ---")
            line_count = 0
            for line in response.iter_lines():
                if line:
                    try:
                        json_line = line.decode('utf-8')
                        line_count += 1
                        json_response = json.loads(json_line)
                        print(json_response["response"], end='', flush=True)
                    except json.JSONDecodeError as e:
                        print(f"Error decoding JSON on line {line_count + 1}: {e}")
                    except UnicodeDecodeError as e:
                        print(f"Error decoding line to UTF-8 on line {line_count + 1}: {e}")
            if line_count == 0:
                print("No JSON lines found or response was empty.")
            print("\n--------------------------------------------------")
            print_ollama_stats(json_response)
            print("\n--------------------------------------------------")
        else:
            print(f"\nError - Response Status code: {response.status_code}")
            print(response.text)
    except Exception as e:
        print(e)


def main():
    parser = argparse.ArgumentParser(description="Ollama API Demo for Logs")
    parser.add_argument('file', type=str, help='Path to the log file to be examined')
    parser.add_argument('--model', type=str, help='Model to use in analysis', default=None)
    parser.add_argument('--prompt', type=str, help='Prompt to send to model', default=None)
    args = parser.parse_args()
    examine_log_with_prompt(args.file, args.prompt, args.model)


if __name__ == "__main__":
    main()

Explore Bronto's AI Features

The Log Management Cost Trap: Part III — Search

Patrick Londa — Tue, 19 May 2026 13:26:43 +0000

Authored by Benoit Gaudin

In Part I (Ingestion) and Part II (Storage) of this series, I explored the challenges of designing, running, and managing a centralised log management solution. In Part III, I'll focus on search.

The Competing Requirements of Log Search

Log data search has two distinct use cases with fundamentally different requirements.

Real-time troubleshooting — when a system outage occurs, engineers need visibility into what caused the issue immediately. Log data must be searchable almost as soon as it's generated. This imposes a hard constraint: batch windows must be short. And short batch windows tend to produce small files.

Large-scale historical analysis — analyzing web or CDN access logs to identify patterns in API usage, track slowly degrading performance trends, or audit activity over weeks or months. Here, data freshness is irrelevant. What matters is the ability to efficiently scan large datasets.

These two use cases create a direct tension. Making data available quickly often means processing small batches and creating many small files — which severely degrades performance when running queries across long time ranges. This is the classic small file problem.

A good log management solution must balance both: newly ingested data searchable immediately, stored in a format that also supports efficient querying over time.

Performant and Cost-Effective Search

As covered in Part II, the right data format and storage strategy are the foundation. Key techniques include indexing, Bloom filtering, and data partitioning.

Needle-in-a-haystack queries

Indexing and Bloom filtering shine when searching for data that appears infrequently across a large time range — for example, finding a specific trace_id across several terabytes of log data. As explained in Why is Bronto so fast at searching logs, well-designed indexing and Bloom filtering can dramatically reduce the volume of data scanned, narrowing the dataset to a much smaller subset more likely to contain the target value.

Full-scan analytical queries

Some queries can't be narrowed. If you want the maximum response time per endpoint over the past few months, every log entry must be examined — there's no rare value to isolate, no filter to push down, no partition to skip.

Pre-aggregated summaries could help if you know in advance exactly how users will slice their data. But general-purpose log management systems can't predict every analytical angle users will need. Full dataset scans are unavoidable.

For these cases, the only viable solution is brute-force compute: massive parallelism and high-performance processing to deliver results even when every record must be touched.

Bronto's approach: AWS Lambda for bursty workloads

To support demanding full-scan queries while keeping costs in check, Bronto uses AWS Lambda functions. Lambda enables high concurrency — large volumes of data stored in S3 can be processed in parallel, on demand, with no infrastructure to provision or manage in advance.

The cost model is key: you only pay for compute time used. Even when running many functions concurrently, short execution times keep overall cost low. This makes it ideal for bursty, unpredictable workloads.

That said, Lambda isn't always the right tool. When query volume consistently exceeds a certain threshold, sustained compute options like AWS EC2 become more cost-effective. The right architecture uses both: Lambda for bursts, EC2 for the baseline.

High Cardinality

Log data frequently contains high-cardinality fields — client IP addresses, trace IDs, user IDs. Queries over these fields (e.g. counting unique IP addresses across a large dataset) can lead to slow performance, high memory consumption, and a poor user experience.

A naive solution is to cap the number of unique values the system handles — but that means users simply can't get value from their data beyond the cap.

A better approach: compute exact results up to a certain cardinality threshold, then switch to approximations when cardinality genuinely becomes too large to handle exactly. Several probabilistic data structures make this practical:

HyperLogLog — approximate distinct counts
Count-Min Sketch — approximate frequency counts
Cuckoo Filter — approximate set membership
Top-K — approximate top values by frequency

This approach keeps resource consumption bounded while still giving users meaningful, actionable insights from high-cardinality data.

Conclusion

This wraps up the three-part Log Management Cost Trap series. Across ingestion, storage, and search, the same theme emerges: design decisions in one layer constrain and shape what's possible in the others. Trade-offs are unavoidable, and navigating toward an optimal solution requires deep experience across all three.

Bronto brings 150+ years of combined experience in log management at scale — and implements that experience into a platform designed to be cost-efficient, high-performance, and ready for logging in the AI era.

See Bronto in Action

The Log Management Cost Trap: Part II — Storage

Patrick Londa — Mon, 18 May 2026 21:07:26 +0000

Authored by Benoit Gaudin

In Part I of this series, I explored the challenges of designing, running, and managing a centralised log management solution, with a focus on data ingestion. In Part II, I focus on data storage. Part III covers search.

I'll discuss different storage types and how their characteristics can fulfil the requirements of log management solutions, how data is organised within these systems, and the role of file formats in enabling efficient ingestion, storage, and retrieval.

Storage Types

When evaluating storage options, the type of storage medium is the first decision to make. File systems and blob storage each come with distinct characteristics.

Disks and File Systems

File systems operate at a lower level of abstraction and often require explicit management of storage capacity, throughput, and IOPS. Managed services like AWS EFS and FSx simplify some of this — EFS, for example, supports automatic scaling of storage and throughput capacity.

One major advantage of file systems is the ability to append data to existing files. This is especially relevant in log management, where data is immutable and continuously streamed.

At Bronto, we leverage file systems for data aggregation — specifically their ability to append to files. Aggregation runs over a few hours before data is transferred to blob storage, so the storage footprint stays modest and cost-effective. This aggregation phase prevents small files from landing on blob storage, which is known to cause performance issues at query time.

Blob Storage

Blob storage is the popular choice for data analytics workloads due to scalability and cost-effectiveness. Unlike file systems, blob storage doesn't support appending — files must be rewritten entirely when modified.

The pricing model differs significantly: costs include both storage and per-transaction API operations (writes, reads). Overall, blob storage is more cost-efficient than remote disks for large, infrequently-modified datasets.

Blob storage also supports extremely high throughput. AWS S3, for instance, enables massive parallel processing — making it ideal for data-intensive workloads like AWS EMR and AWS Athena.

The tradeoff: blob storage isn't well-suited for frequent appends or aggregations. Solutions like Datadog Husky and ClickHouse use compaction to address this — writing many small objects over time, then consolidating them into larger ones.

Bronto combines both: blob storage for long-term, large immutable files; file storage for short-term data aggregation. This balance optimises both performance and cost at scale.

File Formats and Data Organisation

File format alone doesn't determine query performance — how data is physically organised in storage matters just as much. Here are the key techniques.

Compression

Compression is essential at scale. The primary benefit is reduced storage footprint, translating directly into lower costs. At large volumes, the savings are substantial.

That said, maximum compression isn't always ideal. Higher compression ratios demand more CPU, memory, and time — increasing compute cost. The right point on the curve depends on your access patterns.

Row-based vs. Column-based Formats

In row-oriented storage, all fields for each record are stored together sequentially. In column-oriented storage, all values for each field are stored together.

Row-oriented formats suit unstructured data with write-intensive workloads. But with the rise of structured logging and agents that annotate data with attributes, columnar formats have become increasingly relevant for log data — enabling much more efficient scans when you only need specific fields.

Partitioning

Partitioning divides large datasets into smaller segments so queries can skip irrelevant data entirely. The key is choosing a logical criterion for segmentation.

For log data, time-based partitioning is the natural choice — queries almost always specify a time range, so only the relevant time partition needs to be scanned. This dramatically reduces both the volume of data read and the cost of doing so, especially when data is retained over months or years.

Indexing

Indexes work like a book index: rather than reading the entire dataset to find a value, you consult the index to jump directly to where it lives.

Inverted indexes are especially effective for searching uncommon values across large datasets. The tradeoff is size — inverted indexes can grow as large as the original dataset in some cases, significantly increasing storage cost.

Predicate Pushdown

Predicate pushdown evaluates filter conditions using file metadata or summary statistics — without downloading or inspecting full file contents. File formats like Parquet support this by storing column statistics (min/max values) in each data block.

If the statistics for a file guarantee that a filter condition can't match any record in it, the entire file can be skipped. At scale, across datasets distributed across many files, this can dramatically reduce both data transfer and compute cost.

Bloom Filters

A Bloom filter is a probabilistic data structure that answers one question: is a value definitely not present, or possibly present, in a dataset?

When a file's Bloom filter returns "definitely not," the system skips that file entirely — no scan needed. Compared to inverted indexes, Bloom filters are smaller and more lightweight. They don't pinpoint exact data locations, but they're highly effective at eliminating irrelevant files before any data is transferred.

Dictionary Encoding

Dictionary encoding optimises storage and search for key-value pairs where values have low cardinality — country names, log levels, environment tags, and so on. Instead of storing the full value in every row, a compact reference (dictionary entry) is stored, and the actual values live in a separate dictionary.

This reduces storage size and enables a query optimisation: if filtering by a key whose values don't appear in a file's dictionary at all, that file's entire column can be skipped.

Conclusion

Developing a storage strategy for a large-scale log management system demands deep expertise and a clear understanding of data ingestion and access patterns. The choices made at the storage layer directly shape what's possible — and what it costs — at the ingestion and search layers.

Bronto combines file storage for aggregation and blob storage for long-term retention, and borrows techniques from databases and analytics engines — partitioning, Bloom filtering, predicate pushdown, and dictionary encoding — to achieve high search performance at low cost.

In Part III, I'll focus on the approaches and economics of search, and detail how Bronto uses AWS Lambda to provide a fast, cost-effective way to process large volumes of data stored in S3.

See How Bronto Handles This

The Log Management Cost Trap: Ingestion

Patrick Londa — Mon, 18 May 2026 13:14:29 +0000

Authored by Benoit Gaudin

For systems with low log data volumes, self-hosting open-source solutions or using SaaS free plans are often excellent starting points. But as data volume inevitably grows, the complexity and costs associated with these solutions often become unviable.

This post is for you if your logging costs have risen to a point where you're hesitant to send more data, or are excluding certain sources because of what they'd cost to ingest. At that point you're typically faced with two options: invest resources to reduce costs within your existing solution (reducing retention, archiving data, etc.), or build your own logging system for better cost control.

For centralised log management systems, the sheer volume of data and its unstructured nature are typically the biggest factors driving cost and complexity. I break these challenges down into three key areas:

Ingesting large volumes of data
Storing large volumes of data
Querying large volumes of data

These challenges are closely related — design decisions in one area directly impact the others. This post focuses on ingestion. Storage and search will be tackled in follow-up posts.

Ingestion

Ingestion is the part of the system that receives data and processes it to make it searchable. Because of the volumes involved, log management solutions share many similarities with data analytics engines like Hadoop or Spark — but with one critical difference: data must be searchable in real time, or with minimal delay.

This freshness requirement exists because log management supports urgent troubleshooting use cases. In a production incident, engineers need access to logs from the last few minutes immediately — they can't wait for data to be batched. At the same time, other use cases (like browser version analysis across months of traffic) don't require fresh data at all.

Because log management must support both real-time troubleshooting and analytical queries over large historical datasets, it can't rely solely on off-the-shelf analytics platforms. The ingestion pipeline has to be designed with both speed and scale in mind.

Reliability

Upon receiving data, the system must acknowledge its reception and ensure it's securely handled. Mechanisms like data buffering must be in place to gracefully handle temporary issues.

Apache Kafka is an effective and commonly used solution for data buffering at scale, integrated into many log management solutions including ELK, Datadog, and Honeycomb. A Kafka layer in the ingestion pipeline allows the system to absorb temporary processing impediments without data loss.

That said, efficient Kafka cluster management requires real expertise. Even with managed cloud offerings like AWS MSK, the overhead can be substantial and costly at large data volumes.

Indexing and Partitioning

When ingesting log data, how you organise it in the backend directly determines how it can be searched later. Two main approaches exist:

Index-based

Systems like Elasticsearch and OpenSearch build indexes that point to exact locations of relevant data. This offers good search performance but typically requires extracting key-value pairs from logs (e.g. via Logstash in the ELK stack) — and the index itself can grow to a significant size.

Partition-based

No index is involved. Instead, data is organised so that large portions can be skipped entirely at query time. Most log management solutions partition by time range, since log data is timestamped and queries almost always specify a time window.

Some solutions go further and partition on additional attributes beyond time — Grafana Loki and AWS Athena are good examples. Athena stores data on S3 and uses separate prefixes per partition to avoid full-dataset scans.

The hybrid approach

Relying on indexing alone is expensive — building indexes is a heavy task. Partitioning alone may not narrow the dataset efficiently enough. Datadog Husky uses a hybrid approach, and we believe at Bronto this is the right pattern: it provides multiple levers for tuning performance and cost independently.

Append-only and Compaction

Two competing requirements shape how data gets written:

Fresh data must be available to search quickly — ideally within seconds — meaning it must be written in small increments
Large historical datasets must be searchable efficiently, which favours large files and batch-oriented access patterns

Writing lots of small files creates the classic "small files problem" in analytics workloads: many parallel compute units each making small network requests, which kills throughput. Two techniques address this:

Compaction

Used by Datadog Husky and ClickHouse, among others. Data is first stored in small units, then consolidated into larger ones over time. Since small objects only apply to recent data, this remains suitable for historical queries.

Append-only

Data is incrementally added to a growing unit. Easy on a file system, but problematic with object stores like AWS S3 — where appending isn't possible and the entire object must be rewritten on every update. This impacts both performance and ingestion cost.

Despite that limitation, object stores are cost-efficient for long-term storage and well-suited to high-parallelism search access.

Bronto's approach

We implemented a two-tier storage solution: data is first appended to local files, making it immediately available to the search engine; once a file reaches a suitable size, it's uploaded to an object store. This avoids compaction entirely while still keeping fresh data searchable.

Conclusion

Log management solutions are designed to handle vast amounts of unstructured data — a task that introduces significant cost and complexity. They must serve conflicting use cases: real-time troubleshooting that demands fresh data immediately, and analytical queries that demand efficient access to large historical datasets.

At scale, choosing how to ingest data requires careful attention to the trade-offs between reliability, performance, cost, and system complexity. The expertise required to design, implement, and maintain this pipeline is substantial — and that's before accounting for storage and search.

Subsequent posts will cover those remaining challenges. In the meantime, if your logging costs are already a problem worth solving, it's worth understanding what's driving them at each layer.

See How Bronto Handles This

Build Your Own Telemetry UI Using Lovable & Bronto

Patrick Londa — Fri, 15 May 2026 16:27:48 +0000

Authored by Feargal Karney & Mati Remi

The Bronto REST API now exposes everything our own UI is built on. That means you can build a custom interface tailored exactly to your team's workflow, rather than having to use a general-purpose interface someone else designed.

To make it easy to get started, we've published a baseline project on Lovable you can remix into your own workspace. For those who haven't used it, Lovable is an AI-powered frontend builder — think Figma meets Claude Code, but for shipping real React apps. It fuelled its way to $100M ARR in just 8 months.

Here's how to create your very own BrontoVibe project.

Getting Started in 3 Steps

Step 1: Create a free account or log into Lovable, then open the template here.

Step 2: Get an API Key from your Bronto account under Settings → API Keys → Add API Key (API Full Access Role).

Step 3: Remix the project and get prompting!

That's it. No backend to deploy, no auth to configure, no infrastructure to manage.

What the Baseline Project Covers

Search & Explore

Query via SQL or LogQL, view raw events, or plot them on a timeseries. A solid starting point you can extend however you need.

Tracing

Find errors and performance issues, drill into spans across services. One click from any trace takes you to the correlated raw log data — useful when you're mid-incident and need context fast.

Dashboards

View your existing dashboards or ask Lovable to generate new widgets. This is where the AI-builder angle gets interesting.

Usage

Ingestion and search usage broken down, so you always know where your volume is going.

Ingestion Methods

You can send data via:

Quick Ingest — raw log paste for fast testing
Agents — Fluent Bit, Vector, Datadog Agent, OpenTelemetry Collector
Integrations — Akamai, PagerDuty, and more

See the full list in the Bronto integrations docs.

We look forward to seeing what you build!

Remix the BrontoVibe Template

Logging & Observability Best Practices from Bronto

Patrick Londa — Fri, 15 May 2026 13:39:40 +0000

Authored by Conall Heffernan

Centralized logging is a good start to improving your log management — it allows collection, storage, and analysis from multiple sources in a single repository, making it easier to manage and access logs for dev, support, product, and SRE teams, as well as more easily meeting security and compliance requirements.

Having centralized your logs, the practices below will take you further. High-quality logs are the foundation of effective observability. Consistent, structured, and well-tagged log data allows teams to quickly identify performance issues, troubleshoot errors, and optimize cost and performance.

If AI is defined as the intersection of where intelligence meets data … data quality is key in an AI world.

In a world where AIs are starting to automate more and more, having clean, high-quality logs opens up the door to further automation and efficiencies — enabling additional benefits and new AI use cases.

This guide covers recommended best practices for log structure and context enrichment, correlation, agent configuration, team ownership, and log strategy.

1. Log Structure and Context

Tags, log metadata, and message attributes are all key–value pairs (KVPs), but they serve different purposes and live at different levels of your event stream:

Tags – Properties that apply to an entire stream of events (a dataset)
Log metadata – Properties added to individual log records, typically by the logging agent or its plugins
Message attributes – Properties embedded directly in the log message itself

Tags: Properties of the Dataset

Tags apply to all entries in a stream of events and are not visible as part of the log event itself. They are ideal for separating environments at query time (e.g. avoid mixing staging and prod).

Examples of good tags:

environment=production
account_id=12345678
region=us-east-1

Set tags via agent configuration so they are applied automatically to all data processed by that agent. Configuration management tools such as Terraform or CloudFormation can set these tags consistently across your infrastructure.

Log Metadata: Properties of the Source

Log metadata are key–value pairs associated with a specific log, typically added by the agent (often via plugins), not by the application itself. It usually describes:

The host or node — e.g. host_name=web-01, os=linux
The pod or container — e.g. pod_name=api-6c8d3f5c2f-wz2vt, namespace=payments
The service name and version — e.g. service=checkout-api, version=2.3.1

A key point: a single agent can process data from multiple hosts, pods, services, or versions, and the metadata will reflect those differences on a per-record basis.

Message Attributes: Properties Inside the Log Message

Message attributes are key–value pairs present inside the log message body itself, authored by application developers and specific to a single log entry. They're ideal for capturing fine-grained, per-request context:

{"level":"info","message":"request processed","duration_ms":123}

Common examples:

duration_ms=123
request_id=abc-123
retry_count=2

Two supported formats out of the box:

The entire message follows JSON format
key=value format within the log message (values may be quoted; : can be used instead of =)

Note: Indexing is automatic in modern logging platforms — manually managing and configuring indexes is a time-consuming and cumbersome task you shouldn't need to do.

Exception and Stack Trace Handling

Use agent-side multiline support (e.g., FluentBit multiline filter) to capture stack traces as single log events
Report exception name and stack trace as structured attributes:

exception.type
exception.stacktrace

This makes it easy to query and alert on recurring or unexpected exceptions.

2. Correlation

Trace and Correlation IDs

Add fields like trace_id, span_id, and request_id to your logs so you can tie them back to a single user request or workflow across multiple services. In a distributed system, a single call can pass through frontends, APIs, queues, and background workers — without a shared ID, the logs from each hop look like isolated events.

With a common ID, you can filter on that value and reconstruct the full timeline of "what happened where and when," instead of guessing based on timestamps and hosts.

How to add them — it's usually a combination of code and tooling:

A tracing library or standard (such as OpenTelemetry) generates and propagates trace and span context across service boundaries. Most logging frameworks can be configured to automatically include those IDs on every log entry.
At the same time, use an application-level request_id or correlation ID (often taken from or added to an HTTP header at the edge) and pass it through your services.

A robust setup does both: use tracing context (trace_id, span_id) and ensure they are consistently present in logs so any logging or observability system can correlate events end-to-end.

3. Agent Configuration & Processing

The OpenTelemetry Collector and similar agents like Fluentbit, Logstash, and Vector can enrich, sanitize, and optimize log data before it ever reaches storage.

Recommended Configurations

Redact PII before logs leave your infrastructure. Mask or drop fields like emails, full names, IPs, IDs, and tokens at the agent or collector level — so even if logs are leaked or shared, sensitive data isn't exposed.

Configure multiline stacktrace handling so full exceptions are captured as a single log event instead of being split into many noisy lines. This typically means using a multiline rule that continues a record while lines match patterns like ^\s+at.

Normalize log levels before shipping. If you don't, breakdowns by log level in dashboards will look fragmented — instead of a clean INFO / WARN / ERROR, you'll see multiple tiny buckets like info, Info, INFO, error, and ERR that all mean the same thing.

Use batch and memory limiter processors (for example with OTel):

Processor	What it does
`batch`	Groups spans/logs/metrics into batches, improves throughput, reduces overhead
`memory_limiter`	Puts a hard cap on memory usage, drops data or throttles when usage exceeds thresholds

Strike a balance: let agents fix inconsistencies from 3rd-party logs, but rely on developers to structure first-party logs correctly.

4. Team Practices & Ownership

Why It Matters

Logging is not just a technical setup — it's a shared responsibility across teams. Establishing clear ownership early ensures that logs are consistent, searchable, and actionable throughout your organization's lifecycle. It also makes it clear who is accountable for volume control (for example, leaving DEBUG on in production).

Best Practices

Assign team ownership from day one. Each dataset or service should have a defined owning team responsible for log quality, metadata, and alerting setup. This avoids confusion later when troubleshooting or optimizing costs.

Tag logs by team. Include a team or owner tag in metadata or agent configuration. This enables your logging platform to group logs, usage metrics, and cost by responsible team automatically — particularly useful when understanding volume spikes. Set up usage alerts so a given team is notified if their volumes suddenly go off the charts.

Encourage collaboration through shared queries. Make it a habit for teams to share saved queries, dashboards, and monitors. Common examples:

"Error spikes by environment"
"Token usage per service"
"Slowest response patterns over 24h"

Shared queries reduce duplication and foster best-practice discovery internally.

Use team-based datasets. Group data logically — by service ownership rather than by underlying infrastructure — so each team can monitor the performance, health, and behavior of their own services without noise from unrelated systems.

Make accountability visible. Use tags and naming conventions that make ownership clear:

team=payments
service=checkout-api
env=prod

Pro Tip: Building a strong observability culture that promotes best practices early creates long-term efficiency. Teams that own their data from the start rarely need a cleanup project later.

5. Log Types and Strategy

Define what types of logs your organization will collect and how they'll be categorized:

Type	Examples	Notes
Application	Custom app logs	Owned by dev teams
Third-party services	Kafka, NGINX, Redis	Semi-structured; normalize via agents or auto-parser
Infrastructure	syslog, journald	Often managed by SREs
Cloud	AWS, GCP, Azure	Forwarding integration needed; can be high volume (CloudTrail, Load Balancer logs)
Security	CloudTrail, auditd	Coordinate with SecOps/SIEM
CI/CD	Pipeline events	Great for trend correlation

Pro Tip: Review overlap between application and infrastructure logs to avoid duplication and unnecessary ingestion usage. If your app logs request_id, user_id, status, and latency, and NGINX/syslog already records status and latency, keep those fields in one layer and use request_id to correlate — instead of ingesting the same details twice.

Wrapping Up

Good logging is a discipline, not a one-time setup. The combination of structured data, consistent metadata, proper correlation IDs, well-configured agents, and clear team ownership is what separates logs that collect dust from logs that actively drive engineering decisions.

Start with structure, assign ownership early, and build the habit of sharing queries and dashboards across teams. Your future self — debugging a production incident at 2am — will thank you.

Give Bronto a Try

Forem: Patrick Londa

Breaking Logging's Flywheel of Compromises

The Problem We're All Living In

Logs Matter More Than Ever

How We're Breaking the Cycle

Bronto's Architectural Advantage

Real Teams, Real Results

API-First Content Platform

Global SaaS Project Management Platform

Logs as Your Secret Weapon

The CDN Logging Crisis

The Current State of CDN Logging

A Hypothetical (But Entirely Plausible) Scenario

The Three Horsemen of the Logging Apocalypse

Cost Explosion

Coverage Sacrifice

Complexity Creep

CDN Logging for the AI Era

A Different Approach

1. Economics Aligned with Modern Infrastructure

2. Lightning-Fast Search Across Petabytes

3. A Single Unified Logging Layer

Breaking Free from the Flywheel

Logging Your AI Events (from Ollama) in Bronto

Setting Up Ollama

Install and Run the Server

Download and Run a Model

Sending Ollama Logs to Bronto

Step 1: Configure Ollama Logging to File

Step 2: Install OpenTelemetry Collector

Step 3: Configure OpenTelemetry to Forward to Bronto

A Simple Ollama API Program

Searching Ollama Logs in Bronto

Conclusion

Appendix: ollama-log-demo.py

The Log Management Cost Trap: Part III — Search

The Competing Requirements of Log Search

Performant and Cost-Effective Search

Needle-in-a-haystack queries

Full-scan analytical queries

Bronto's approach: AWS Lambda for bursty workloads

High Cardinality

Conclusion

The Log Management Cost Trap: Part II — Storage

Storage Types

Disks and File Systems

Blob Storage

File Formats and Data Organisation

Compression

Row-based vs. Column-based Formats

Partitioning

Indexing

Predicate Pushdown

Bloom Filters

Dictionary Encoding

Conclusion

The Log Management Cost Trap: Ingestion

Ingestion

Reliability

Indexing and Partitioning

Index-based

Partition-based

The hybrid approach

Append-only and Compaction

Compaction

Append-only

Bronto's approach

Conclusion

Build Your Own Telemetry UI Using Lovable & Bronto

Getting Started in 3 Steps

What the Baseline Project Covers

Search & Explore

Tracing

Dashboards

Usage

Ingestion Methods

Logging & Observability Best Practices from Bronto

1. Log Structure and Context

Tags: Properties of the Dataset

Log Metadata: Properties of the Source

Appendix: `ollama-log-demo.py`