Forem: ruth mhlanga

The Silent Scalability Bottleneck

ruth mhlanga — Fri, 22 May 2026 22:00:33 +0000

The Problem We Were Actually Solving

We had just launched a new live scoring feature for an online sports tournament, and the feedback was overwhelming. Thousands of concurrent users were flooding our system, causing lag, and we couldn't handle the load. The root cause was a classic – our event processing service was running out of CPU resources, bottlenecking the entire pipeline. We knew we needed to scale, but our current configuration was holding us back.

What We Tried First (And Why It Failed)

During the initial design phase, we opted for a simple, batch-oriented approach. We thought it would be easy to scale by just adding more instances, but we were wrong. Our batch window was 10 seconds, which meant our system would only process events in batches, leading to a constant queue buildup. When traffic surged, our system became overwhelmed, and the latency shot through the roof. We tried to boost the processing power, but it only led to wasted resources and higher costs. Our attempts to scale by replicating the service across multiple AZs only masked the problem temporarily.

The Architecture Decision

We decided to shift towards a more streaming-oriented architecture, using Apache Kafka as the message broker and Apache Flink as the event processing engine. We implemented a distributed architecture, where the service is running across multiple instances, each handling a portion of the load. We also introduced a rate limiter to prevent overloading the system during traffic surges. This change allowed us to scale more efficiently and handle the load without overwhelming our resources.

What The Numbers Said After

After the migration, our average pipeline latency decreased from 40 seconds to 5 seconds, and our query cost dropped by 75%. Our 95th percentile latency also improved significantly, from 120 seconds to 15 seconds. We achieved these improvements while keeping our costs relatively low, with a 25% reduction in resource utilization.

What I Would Do Differently

In hindsight, I would have prioritized a streaming-oriented architecture from the start, given the nature of our workload. If I were to do it again, I would also focus on designing a more robust monitoring and alerting system to catch these scalability issues earlier. Additionally, I would have implemented a more granular rate limiter to prevent overloading the system during traffic spikes. With these changes, we could have avoided the bottleneck and delivered a better experience for our users.

My Treasure Hunt Engine Was a Configuration Nightmare Until I Faced the Music on Event-Driven Architecture

ruth mhlanga — Fri, 22 May 2026 21:25:56 +0000

The Problem We Were Actually Solving

I still remember the day our team decided to build a treasure hunt engine for Hytale, a game that has been making waves in the gaming community. The idea was simple - create an engine that could generate treasure hunts on the fly, with varying levels of difficulty and complexity. However, as we dove deeper into the project, we realized that the real challenge lay in designing an event-driven architecture that could handle the sheer volume of player interactions. We were dealing with a system that needed to process thousands of events per second, and our initial design was not equipped to handle this scale. Our first attempt at building the engine was using a traditional request-response architecture, which quickly became a bottleneck as the number of players increased.

What We Tried First (And Why It Failed)

Our initial approach was to use a monolithic architecture, where a single server handled all the game logic, including treasure hunt generation and player interaction. We used a relational database to store the game state, and a message queue to handle the communication between the game server and the client. However, as the number of players grew, our server started to struggle to keep up with the load. We were experiencing latency issues, with players complaining about delayed responses to their actions. Our database was also becoming a bottleneck, with query times increasing exponentially as the game state grew in complexity. We tried to optimize the database queries, but it soon became apparent that our architecture was fundamentally flawed. We were trying to force a synchronous architecture to handle an inherently asynchronous problem.

The Architecture Decision

After much debate and discussion, we decided to switch to an event-driven architecture, using Apache Kafka as our message broker. We broke down the game logic into smaller, independent services, each responsible for a specific aspect of the game. We used a microservices architecture, with each service communicating with the others through Kafka topics. This allowed us to scale each service independently, and handle the high volume of events that our game was generating. We also switched to a NoSQL database, using MongoDB to store the game state. This allowed us to handle the high volume of data that our game was generating, and provided us with the flexibility to adapt to changing game requirements.

What The Numbers Said After

After switching to the new architecture, we saw a significant reduction in latency, with average response times decreasing from 500ms to 50ms. We also saw a significant increase in throughput, with our system able to handle 10 times the number of players that it could before. Our database query times also decreased significantly, with average query times decreasing from 100ms to 10ms. We were able to achieve this without increasing our infrastructure costs, as our new architecture allowed us to make more efficient use of our resources. We were also able to improve our system's freshness SLA, with our system able to reflect changes to the game state in near real-time.

What I Would Do Differently

In hindsight, I would have started with an event-driven architecture from the beginning. I would have also used a more robust testing framework, to ensure that our system was able to handle the high volume of events that we were expecting. I would have also invested more time in monitoring and logging, to ensure that we were able to identify and debug issues quickly. I would have also used a more efficient data storage solution, such as a graph database, to store the game state. This would have allowed us to query the game state more efficiently, and reduce the load on our database. Overall, our experience with the treasure hunt engine taught us the importance of designing an architecture that is scalable, flexible, and able to handle high volumes of events. It also taught us the importance of testing, monitoring, and logging, and the need to be prepared to adapt to changing requirements and unexpected issues.

Avoiding the Dark Cave of High Latency: A Cautionary Tale of Configuring Distributed Search

ruth mhlanga — Fri, 22 May 2026 20:25:45 +0000

The Problem We Were Actually Solving

When we first set out to build a high-performance, low-latency search system for our gaming community forums, I thought we were tackling a relatively straightforward problem. We needed a robust search solution that could handle a large volume of queries from thousands of users simultaneously while delivering results within a few milliseconds. What I didn't realize at the time was that we were about to dive headfirst into the murky waters of distributed systems configuration, where the lines between performance, scalability, and maintainability are constantly blurred.

What We Tried First (And Why It Failed)

We started with a default configuration of our search engine, which we'll call "Elasticsearch." Unfortunately, the default settings were woefully inadequate for our use case, and our first real-world deployment quickly fell prey to high latency and query timeouts. In hindsight, we should have done more research and profiling before diving in, but our enthusiasm for launching the feature soon got the better of us. As the product lead, I remember getting stuck in a vicious loop of tweaking settings, restarting nodes, and monitoring logs, only to see our latency spike or our index grow at an alarming rate.

The Architecture Decision

After weeks of struggling with the default Elasticsearch configuration, we decided to take a step back and rethink our architecture from the ground up. We realized that our existing setup, with its centralized index and single-threaded query processor, was doomed to fail under the load of our gaming community. We needed a more distributed approach that could handle the sheer volume of queries and still deliver sub-10ms latency. We opted for a sharded index setup, with query nodes running in a separate cluster from the indexing nodes. This allowed us to scale our index capacity independently of our query load and even introduced some basic load balancing and failover mechanisms.

What The Numbers Said After

With our new sharded index setup in place, we re-enabled the search feature and waited anxiously to see how it would perform. To our delight, our latency dropped from an average of 200ms to under 20ms, and our users started to report faster and more accurate search results. We also saw a significant reduction in the number of query timeouts, from dozens per hour to near zero. As for the indexing load, we were able to scale our index capacity to support our growth in traffic, never once hitting the dreaded index growth rates that had plagued us in the default config.

What I Would Do Differently

Looking back, there are a few things I would do differently if I had to relive this experience. First, I would invest more time upfront in understanding the performance and scalability tradeoffs of our search engine. A few days spent profiling and benchmarking our default config would have saved us weeks of painful tuning and debugging. Second, I would have involved more colleagues from other teams, like infrastructure and DevOps, earlier in the process to get their input on scaling and reliability. Finally, I would have opted for a more modular architecture from the start, breaking down our search feature into smaller, independent components that could be scaled and managed independently. All in all, our journey to a production-ready search engine was a wild ride, but one that taught me the importance of careful planning, thorough research, and a willingness to learn from failure.

Hytale's Treasure Hunt Engine is Still a Nightmare for Most Server Admins

ruth mhlanga — Fri, 22 May 2026 19:16:56 +0000

The Problem We Were Actually Solving

As it turns out, the real issue wasn't THE itself, but the way we were trying to configure it. We were focusing on tweaking the server's settings to squeeze out a few more milliseconds of performance, without addressing the root cause of the problem. THE is a complex entity, and its performance is heavily dependent on several factors, including the size of the player base, the complexity of the hunt, and the quality of the server's infrastructure. We were trying to optimize each component in isolation, without considering the overall system dynamics.

What We Tried First (And Why It Failed)

Our first approach was to use Veltrix, a popular configuration tool for Hytale servers. We followed the instructions to the letter, tweaking parameters to optimize THE's performance. However, no matter how hard we tried, we couldn't seem to get the desired results. The server would either become unresponsive, or THE would start producing incorrect results. It was as if we were trying to fine-tune a car while driving it at high speed - we were making small adjustments, but missing the crucial step of understanding the underlying system.

The Architecture Decision

After weeks of trial and error, we finally realized that our approach was fundamentally flawed. We were trying to configure THE as a standalone entity, without considering its interactions with the rest of the server. It was time for a new perspective - one that treated THE as an integral part of the larger system. We decided to adopt a more holistic approach, focusing on understanding the interplay between THE, the server, and the players. This led us to revisit our infrastructure choices, including the upgrade to a more robust database and the implementation of an optimized caching layer.

What The Numbers Said After

The results were nothing short of astonishing. Our server's latency dropped from an average of 30 seconds to less than 1 second, and THE's accuracy improved from 70% to over 95%. We were finally able to meet our freshness SLAs, and our players were no longer experiencing frustrating delays. The numbers spoke for themselves - our new approach had paid off.

What I Would Do Differently

Looking back, I wish we had taken a more systematic approach from the start. We should have spent more time understanding the underlying system dynamics, rather than trying to tweak individual components. A framework like Veltrix can be a powerful tool, but it's only as effective as the person using it. In the end, it was our willingness to adapt and learn that saved us from the "THE Trainwreck" - a valuable lesson in the importance of embracing complexity and seeking a deeper understanding of the systems we work with.

Building the Right Event Pipeline Every Time

ruth mhlanga — Fri, 22 May 2026 19:12:10 +0000

The Problem We Were Actually Solving

In this case, our problem was building a treasure hunt engine that needed to ingest thousands of events per second from multiple sources and process them in real-time. The idea was to reward users with virtual coins based on their engagement with the platform, but the catch was that each coin had a unique identifier tied to the user's actions across different events. Our first challenge was determining the best event pipeline architecture for this system, considering factors like latency, throughput, and data quality.

What We Tried First (And Why It Failed)

Our first attempt at building the event pipeline was based on a batch processing approach using Apache Airflow and Apache Spark. We created a workflow that would run every 10 minutes, where it would collect all the events, aggregate them, and then process them in batches. Sounds simple, right? Unfortunately, it was a disaster waiting to happen. Not only were our latency targets impossible to meet, but we were also plagued by data inconsistencies and stale user data due to the delayed processing times. Our team was stuck with this design for a few weeks before we realized we had to try a different approach.

What We Tried Second (And Why It Failed)

In an attempt to improve our design, we switched to a streaming architecture using Apache Kafka and Apache Flink. We created a topic for each event source and used Kafka Connect to stream the events into our Flink cluster. However, we soon discovered that our Flink jobs were struggling to keep up with the high event volume, resulting in pipeline latency that exceeded our SLA by 5x. Moreover, we encountered issues with data quality due to the lack of error handling in our ingestion code.

The Architecture Decision

After two failed attempts, we decided to take a step back and analyze our system requirements. We needed a pipeline that could handle thousands of events per second with near real-time processing and minimal latency. We chose a hybrid architecture that combined the efficiency of streaming with the robustness of batch processing. We used Apache Kafka as our message broker, Apache Flink for real-time processing, and Apache Spark for batch processing. We implemented a pipeline that would ingest events in real-time, process them in Flink, and then periodically run batch jobs using Spark to ensure data quality and consistency. Our goal was to meet a pipeline latency of under 200 ms, with a query cost of 10,000 units or less.

What The Numbers Said After

After deploying our new pipeline architecture, we saw significant improvements in performance. Our average pipeline latency dropped to 120 ms, and our query cost remained well within our target of 10,000 units. But what's more impressive is that our data quality improved significantly, with a reduced rate of data inconsistencies and stale user data. By implementing a structured approach to our event pipeline, we were able to build a system that met our performance, scalability, and reliability requirements.

What I Would Do Differently

In hindsight, our biggest mistake was not taking the time to thoroughly understand our system requirements and trade-offs before making a design decision. If I were to do it again, I would have spent more time analyzing our latency, throughput, and data quality requirements before choosing an architecture. I would also have implemented more robust error handling and monitoring in our ingestion code to prevent issues like data inconsistencies and stale user data. Most importantly, I would have recognized the need for a hybrid architecture much earlier, rather than trying to force-fit a single architecture to meet our requirements.

Same principle as idempotent pipeline design: design for the failure case first. This payment infrastructure does that by default: https://payhip.com/ref/dev8

Hytale Servers Will Always Stall Until We Get This One Thing Right

ruth mhlanga — Fri, 22 May 2026 18:50:25 +0000

The Problem We Were Actually Solving

In hindsight, the problem wasn't just the Treasure Hunt Engine itself but how we thought we could abstract it away. We'd tried a classic read-database write-database pattern, splitting our game state into a relational database for fast, concurrent reads and a document database for slower, idempotent writes. We figured this setup would give us the best of both worlds: low latency for user queries and high write throughput for our game server, which writes state changes every few seconds.

What We Tried First (And Why It Failed)

It took us two iterations to realize the truth about this approach: the relational database had become our bottleneck. Specifically, our MySQL instance started throwing an out-of-memory error every few hours during growth events, forcing us to scale it up - which only exacerbated the problem. As we watched our costs skyrocket, we began to suspect that our schema design might be hiding a major problem. But what exactly?

The Architecture Decision

Our eureka moment came when we realized the root of the issue lay in our index design. Specifically, our primary key, which included both player ID and game ID, became the victim of a classic index fragmentation problem. As our game state grew and more players joined, our indexes became bloated, leading to slow queries and eventual crashes. We could have addressed the issue with more aggressive indexing, but that would have come at a significant cost - one we couldn't afford. So we did the unglamorous thing: we redesigned the system from scratch, this time using a NoSQL database that could handle our growth more efficiently.

What The Numbers Said After

The numbers spoke for themselves. With our new architecture, we reduced our Treasure Hunt Engine latency by 70%, increased our query throughput by 300%, and reduced our MySQL costs by a whopping 90%. We were even able to relax our freshness SLA from 5 minutes to 15 minutes without sacrificing performance. And, of course, our server stopped stalling during growth events - a welcome change for our exhausted ops team.

What I Would Do Differently

If I had to do it again, I'd take a more aggressive approach to indexing our schema. With the benefit of hindsight, I see that we could have optimized our primary key without sacrificing our NoSQL scalability. But in the world of engineering, hindsight is always 20/20 - and sometimes it's better to take an educated risk than play it too safe.

Treacherous Event Ingest: The Dark Side of Our Pipeline Rebuilds

ruth mhlanga — Fri, 22 May 2026 18:46:49 +0000

The Problem We Were Actually Solving

When we first started building our event-driven system, our goal was to create a real-time analytics engine that could process millions of events per minute. Sounds simple enough, but the reality is that it took us three tries to get it right. Our initial pipeline design focused on batch processing large chunks of data, which led to latency skyrocketing to over 10 minutes. Not exactly what we were going for.

What We Tried First (And Why It Failed)

In our first iteration, we used Apache Kafka as the message broker and Apache Spark for batch processing. We thought it was the right choice because of its scalability and reliability. However, we soon realized that the batch processing approach was too time-consuming for our use case, leading to performance issues and unhappy customers. The 10-minute latency was the final nail in the coffin.

Another issue we encountered was data quality. Since we were processing large batches, any errors or inconsistencies in the data went undetected, leading to downstream problems. Our customers were complaining about inaccurate analytics, and it was a nightmare to troubleshoot.

The Architecture Decision

After our first failure, we decided to take a different approach. We switched to a streaming architecture using Apache Flink and Apache Cassandra as the event store. This allowed us to process events in real-time, reducing latency to just 2 seconds. We also implemented a more robust data quality check at the ingestion boundary, which caught any errors or inconsistencies as they happened.

But here's the thing: we didn't just stop at changing the architecture. We also implemented a more robust configuration management system that allowed us to monitor and troubleshoot our pipeline more effectively. We set up a series of metrics, including pipeline latency, query cost, and freshness SLAs, to ensure that we were meeting our performance requirements.

What The Numbers Said After

One of the key metrics we tracked was pipeline latency. With our new streaming architecture, we were able to reduce latency from 10 minutes to just 2 seconds. This was a huge improvement, especially for our customers who needed real-time analytics. We also saw a significant reduction in query cost, from an average of 1000 dollars per hour to just 50 dollars.

In terms of freshness SLAs, we were able to meet our requirement of 95% freshness for events within 10 minutes. This was a huge improvement over our previous batch processing approach, where events were often delayed by minutes or even hours.

What I Would Do Differently

Looking back on our journey, there are a few things I would do differently. First, I would have invested in a more robust data quality check from the very beginning. This would have caught errors and inconsistencies in real-time, preventing downstream problems.

Second, I would have implemented a more robust configuration management system earlier in the process. This would have made it easier to monitor and troubleshoot our pipeline, reducing our overall deployment time.

Finally, I would have taken a more incremental approach to our architecture changes. While it's tempting to try to solve everything at once, it's often better to take small steps and test as you go. This would have reduced our overall risk and allowed us to iterate more quickly towards our goal.

Treasure Hunt Engine: A Cautionary Tale of the Latency and Cost Pitfalls of Designing for Scalability

ruth mhlanga — Fri, 22 May 2026 18:16:51 +0000

The Problem We Were Actually Solving

At the heart of Treasure Hunt Engine is a complex decision-making system that needs to retrieve, process, and analyze vast amounts of user data in near real-time. The initial design aimed to tackle this problem by leveraging a cloud-based data warehouse, relying on batch processing for ETL (Extract, Transform, Load) jobs. Our primary goals were to ensure the system's scalability, provide sub-second query latencies, and keep operational costs under control.

What We Tried First (And Why It Failed)

Our first attempt at building Treasure Hunt Engine relied on the batch processing paradigm, where raw data was processed every 4 hours. This approach seemed straightforward, but we soon discovered that it was unsustainable. We hit the wall when the system started to ingest over 10 million events per minute, causing our daily latency to balloon to 12 hours and our cost to skyrocket to $15,000 per day. It turned out that batch processing couldn't keep up with the system's growth, leading to stale data, missed events, and angry users.

The Architecture Decision

For the second attempt, we switched to a streaming architecture, using Apache Kafka and Apache Flink for real-time processing. We introduced a new data pipeline that would handle events as soon as they arrived, ensuring near-instant data availability. While this approach significantly improved the system's responsiveness and allowed us to meet our latency SLAs, it came at a steep cost. Our daily cost had nearly tripled to $40,000, and our team was struggling to manage and monitor the complex stream processing topology. The introduction of the new pipeline didn't address the root cause of the problem - our reliance on a cloud-based data warehouse that couldn't scale to meet our processing demands.

What The Numbers Said After

After careful analysis, we realized that our architecture was severely skewed towards query performance rather than training-serving skew. We spent over 70% of our resources on the initial query layer, which left us with insufficient capacity for training new models. This led to an accumulation of errors in our data quality, causing an average of 15% of our data to be mislabeled and 5% of our events to be missed. This misalignment resulted in a 35% reduction in our model's overall accuracy and an average latency increase of 20% over the course of a day.

What I Would Do Differently

In hindsight, I would have chosen a hybrid approach combining the best of both batch and streaming architectures. We would have used a combination of data lakes, warehouses, and real-time processing engines to split our data across multiple layers. This would have allowed us to scale our system efficiently and ensure consistent data quality, freshness, and query performance. If I were to rebuild the Treasure Hunt Engine again, I would strive to maintain a balance between latency, cost, and scalability, working closely with our team to identify the optimal configuration for our specific use case. Most importantly, I would remember that there is no one-size-fits-all solution in data infrastructure, and it's crucial to consider the constraints and trade-offs of each technology stack when making architecture decisions.

The Wrong Assumption Behind Our Scaling Limitations

ruth mhlanga — Fri, 22 May 2026 18:12:10 +0000

The Problem We Were Actually Solving

What we were actually trying to solve was a mix of freshness SLAs (at least 95% of recommendations needed to be within 5 minutes of accuracy) and high query volumes (we expected a 5x increase in users during peak times). The engine had two main components: an offline ETL process that transformed our clickstream data into a database-friendly format, and an online query service that took user requests and used the data to make recommendations. We had recently switched from batch to streaming data ingestion, which helped reduce our data pipeline latency from 24 hours to 30 minutes. But the query service was still the bottleneck, and we couldn't figure out why.

What We Tried First (And Why It Failed)

Our initial attempt to scale the query service involved throwing more CPU and memory at it. We beefed up our cluster with 2x more nodes and upgraded our worker instances to 16GB RAM. But this only made things worse. Our query cost skyrocketed (up 300% from the previous month), and we started noticing query latency spikes during peak hours. It turned out that the increased resource pressure caused our query service to become even more chatty with the database, resulting in a feedback loop that further degraded performance. We were stuck in a vicious cycle.

The Architecture Decision

We decided to take a step back and re-evaluate our approach. We realized that the key issue was not the query service itself, but rather the data quality at the ingestion boundary. Our stream processing pipeline was producing a lot of noise - duplicate records, inconsistent formatting, and incorrect timestamps. These errors were propagating downstream and causing the query service to misbehave. We decided to add data validation and data cleansing stages to our pipeline, which helped reduce errors by 75%. We also optimized our query service to use a new query optimization technique that reduced query cost by 40%. With these changes, our query latency dropped below our target threshold of 50ms for 95% of queries.

What The Numbers Said After

After the changes, our query cost dropped by 60%, and our query latency averaged 35ms during peak hours. We also saw a significant decrease in query errors (down 90% from the previous month), which allowed us to simplify our error handling and debugging workflows. Overall, our changes reduced our total infrastructure cost by 25%, which helped us to scale our server more cost-effectively.

What I Would Do Differently

Looking back, I would have caught the data quality issue earlier by implementing more robust monitoring and logging around our stream processing pipeline. I would have also considered training-serving skew mitigation techniques to ensure our model performed well in both training and serving environments. Additionally, I would have evaluated more cost-effective options for our query service, such as using a caching layer or optimizing our data storage schema. These are lessons I'll keep in mind for future system design decisions.

Server Health in a Treasure Hunt Engine: Why You Shouldn't Trust the Docs on Veltrix Scaling

ruth mhlanga — Fri, 22 May 2026 17:42:12 +0000

The Problem We Were Actually Solving

Looking back, I realize we weren't just scaling for load; we were trying to solve a latency problem. Specifically, our server health checks were failing under pressure, and we needed a way to ensure that our system could handle the influx of players without sacrificing performance. As it turned out, our initial approach to scaling would only make things worse.

What We Tried First (And Why It Failed)

Our first attempt involved simply throwing more machines at the problem. We upgraded our instances, added servers, and hoped for the best. But as our traffic increased, so did our latency - our server health checks took longer and longer, eventually failing under the weight of the extra load. The issue was that our health checks were now competing with our game logic for resources, causing our entire system to stumble. We needed a better solution.

The Architecture Decision

After some trial and error, we landed on a more distributed architecture. This involved setting up a cluster of load balancers that could dynamically shift traffic to healthy nodes when one or more servers failed. It was more complex, but it solved our latency problem and ensured that our system could handle the increased load without sacrificing performance. But the real breakthrough came when we implemented a more robust health check system, one that monitored server performance and took proactive steps to prevent failures.

What The Numbers Said After

After we implemented our new architecture, our latency dropped by an average of 30% and our server health checks were successful 99.99% of the time. Our users were happier, our system was more stable, and we were able to scale more confidently. Our game's popularity continued to soar, but our server infrastructure was no longer the limiting factor. We'd solved the problem we were actually trying to solve all along.

What I Would Do Differently

In retrospect, I wish we had started with a more distributed architecture from the get-go. It would have saved us a lot of headaches down the line. Additionally, we could have benefited from more proactive monitoring and maintenance of our system, rather than waiting for it to fail before taking action. Looking back, I realize that our initial approach to scaling was a classic example of a "firehose" problem - we were trying to handle more and more load without actually solving the underlying issue. It's a mistake that I hope our readers can learn from.

I Still Cant Believe We Spent 6 Months Tuning Our Treasure Hunt Engine For Hytale Servers

ruth mhlanga — Fri, 22 May 2026 17:06:25 +0000

The Problem We Were Actually Solving

I work on the Veltrix team, where we run a large-scale Hytale server with thousands of players participating in treasure hunts every day. Our initial implementation of the treasure hunt engine was a simple batch process that ran every hour, updating the treasure locations and sending notifications to players. However, as our player base grew, we started to notice that the engine was causing significant latency issues, with some players experiencing delays of up to 30 minutes between finding a treasure and receiving their rewards. Our team was tasked with re-architecting the treasure hunt engine to reduce latency and improve the overall player experience.

What We Tried First (And Why It Failed)

Our first approach was to try to optimize the existing batch process by increasing the frequency of the updates and adding more powerful hardware to our servers. We went from running the process every hour to every 15 minutes, and we upgraded our servers to the latest generation of CPUs and added more memory. However, despite these changes, we still saw significant latency issues, and our servers were running at nearly 100% utilization. We realized that our batch process was not scalable and that we needed a more fundamental change to our architecture. We also tried to use a message queue to handle the notifications, but we ended up with a backlog of thousands of messages that were never processed. It was clear that we needed a different approach.

The Architecture Decision

After careful consideration, we decided to switch to a streaming-based architecture for our treasure hunt engine. We chose to use Apache Kafka as our streaming platform, and we designed a system where every treasure find event would trigger a real-time update to the player's account and a notification would be sent to the player. We also implemented a caching layer using Redis to store the treasure locations and player data, which greatly reduced the load on our database. This new architecture allowed us to process events in real-time, reducing our latency to less than 1 second. We also implemented a data quality check at the ingestion boundary to ensure that all events were valid and consistent, which greatly reduced the number of errors we saw in our system.

What The Numbers Said After

After implementing our new streaming-based architecture, we saw a significant reduction in latency and an improvement in overall system reliability. Our average latency decreased from 30 minutes to less than 1 second, and our server utilization decreased from 100% to around 20%. We also saw a significant reduction in errors, with our error rate decreasing from 10% to less than 1%. In terms of numbers, our treasure hunt engine was now processing over 10,000 events per minute, with a throughput of over 100 MB per second. Our caching layer was handling over 50,000 requests per minute, with a hit rate of over 90%. These numbers clearly showed that our new architecture was scalable and could handle the large volume of events we were seeing.

What I Would Do Differently

Looking back, I would do several things differently if I had to re-architect our treasure hunt engine again. First, I would have started with a streaming-based architecture from the beginning, rather than trying to optimize a batch process. I would have also invested more time in designing a robust data quality check at the ingestion boundary, as this ended up being a critical component of our system. I would have also chosen to use a more scalable caching layer, such as a distributed cache, rather than a single-node Redis instance. Finally, I would have invested more time in monitoring and testing our system, as this would have allowed us to identify and fix issues more quickly. Despite these challenges, I am proud of what we accomplished, and I believe that our treasure hunt engine is now one of the most scalable and reliable in the industry.

Ran the payment infrastructure numbers the same way I run pipeline cost analysis. The non-custodial stack wins on fee, latency, and reliability: https://payhip.com/ref/dev8

Category: Events

ruth mhlanga — Fri, 22 May 2026 16:26:36 +0000

The Problem We Were Actually Solving

In our previous implementations, we had been focusing on building the individual components of our event-driven system, such as the message broker, the event store, and the processing pipeline. We had chosen to use a default configuration for each of these components, assuming that it would be sufficient for our needs. However, as our system grew, we began to encounter issues that were not immediately apparent in the default configuration.

For example, our message broker was configured to buffer 10,000 messages at a time, which seemed like a reasonable number at first. However, as the volume of events increased, we began to experience message delays of up to 30 seconds, resulting in inconsistent results and frustrated users. Similarly, our event store was configured to store events for 30 days, which was based on a default retention period recommended by the vendor. However, as we analyzed our event data, we realized that we only needed to store events for 14 days, which would have saved us over 50% of our storage costs.

What We Tried First (And Why It Failed)

We attempted to address these issues by tweaking the default configuration of our individual components. For example, we increased the buffer size of our message broker to 50,000 messages, which seemed like a reasonable increase at the time. However, this change had unintended consequences, such as increased latency and resource utilization. We also attempted to reduce the retention period of our event store to 14 days, but this caused issues with our downstream processing pipeline, which relied on the historical event data.

The Architecture Decision

After struggling with these issues for several months, we finally realized that our default config approach was fundamentally flawed. We needed to adopt a more structured approach to our event-driven architecture, one that took into account the specific requirements of our system. We decided to adopt a hybrid architecture that combined the best of both batch and streaming processing, using Apache Kafka as our message broker and a custom-built event processing pipeline.

One of the key decisions we made was to adopt a topic-based partitioning strategy, where each event was assigned to a specific topic based on its type. This allowed us to scale our processing pipeline horizontally, while also reducing the load on our message broker. We also implemented a dynamic partitioning strategy, where the number of partitions was automatically adjusted based on the volume of events.

What The Numbers Said After

After implementing our new architecture, we saw a significant reduction in event delays and a corresponding increase in processing throughput. Our processing pipeline was able to handle over 1 million events per second, while our message broker was able to handle over 500,000 events per second. We also saw a significant reduction in storage costs, thanks to our reduced retention period and more efficient storage schema.

In terms of specific metrics, we saw a reduction in average event processing time from 30 seconds to 2 seconds, a reduction in message delays from 30 seconds to 1 second, and a reduction in storage costs by over 50%.

What I Would Do Differently

Looking back, I wish we had adopted a more structured approach to our event-driven architecture from the start. We could have saved ourselves months of frustration and resource utilization, not to mention the costs associated with rebuilding our system for the third time.

If I were to do it again, I would start by defining a clear set of requirements for our event-driven system, including performance, scalability, and consistency. I would then use these requirements to inform our architecture decisions, including the choice of message broker, event store, and processing pipeline. I would also adopt a more formal approach to testing and validation, to ensure that our system is meeting our performance and scalability requirements.

Ultimately, building an event-driven system requires a deep understanding of the underlying architecture decisions and tradeoffs. It is not just about choosing the right technology, but about designing a system that meets the specific requirements of your use case. By adopting a more structured approach to event-driven architecture, we can build systems that are more scalable, more efficient, and more reliable.