Forem: LogDNA

Tailing Logs with Node.js

Darin Spivey — Thu, 22 Oct 2020 15:02:59 +0000

Most techies will be familiar with the age-old tail -f <filename> command in Unix-like systems. It's great for shell scripting and CLI commands, but what about being able to tail a file in a language such as Node.js? Sure, you could fork a child process and scrape stdout, but that's not fun (or reliable) no matter what language you're in. Let's examine what it takes to do it right.

Why Do We Need It?

The ability to "tail" a file in Node.js can have many use cases. It could be for a sysadmin dashboard that looks for certain errors in /var/log/system.log for which you'd want to examine every line across log rolls for a particular pattern. Even if log rolling isn't a concern, if a file needs to be tailed programmatically, something that avoids creating a child process to run the real tail -f command is less expensive and easier to manage.

For LogDNA, the tail of a file is the foundation of our Node-based agents: They need to watch (many) files for changes and send those lines up to LogDNA servers for ingestion, so they need a Node-based tailing method. Unfortunately for us, several of the packages available on NPM, although they optimally use streams, do not properly respect stream backpressure and blindly push data through the stream regardless of whether or not something is consuming it. That's a big no-no when working at large scales since that can lead to data loss in the stream. If the idea is to have a tail read stream as with a logging system sending data, then it needs to properly implement this functionality.

What is Stream Backpressure?

Backpressure is a condition that happens in both readable and writable streams. Although the word stream implies a constant flow of data, there is still an internal buffer that acts as a temporary bucket for data to live while it's being written or read. Think of a busy line at a continental buffet breakfast. There is a steady flow of patrons wanting their morning bagel. As they move past the bagel tray and take one, the employee behind the table must provide (periodically) fresh bagels to keep the tray full. The same concept applies to streams. The internal buffer (the tray) exists so that data (bagels) can periodically be provided and is always available when it's needed. Streams place data into the buffer by calling a push() method (for readables), or a write() method (for writables). The problem is that the buffer size is NOT unlimited and therefore can fill up. When that happens, Node.js terms it as backpressure. Whatever is trying to put data into the buffer is told to stop (by returning false from push() or write() calls) until Node.js signals that it's ready for more data. Mind you, most of this control flow is internal to Node's various stream classes, but implementers must define functions like _read() since Node will call it when backpressure has ended.

What Are Some Other Pitfalls?

The main difficulty with doing file I/O properly at scale is efficiency. Reading chunks of a file at scale, especially in production, should not be done by reading all of the changes into a buffer. The size of the data that you need to consume may vary widely depending on throughput to the log file. For example, if the log is getting flooded with entries, then a one-second poll could result in thousands of kilobytes (kB) or even megabytes (mB) of log lines that need to be read. Trying to read that into a buffer all at once will, at best, slow down your system; at worst, it will fall over. Just think, then, that a server that does 1000+ requests per second, which is a very reasonable expectation, will have a TON of log entries every second. The sheer scale of that data means backpressure issues are a very real possibility.

However, creating an efficient tailing package isn’t just dealing with the backpressure problem. Here are some of the challenges that any solution needs to consider:

Since file "watchers" are not reliable across operating systems (even with node's built-in watcher module), we need a polling solution to repeatedly query the file for changes. This problem requires the code to keep the state of the last position (kind of like remembering where a cursor was when you reopen a document) and whether or not the file has been renamed.
Consuming the added lines should be done via a stream to avoid reading file chunks into memory all at once.
How can we ensure that no lines are lost? If a file is rolled between polls, then the "old" file may contain lines that will not be read on the next poll of the "new" file.
Similar to log rolling, if the file is truncated manually or otherwise, the code cannot resume reading from its previous position. It will have to detect this case and start reading from the beginning of the file.

Overall, a tailing solution that accounts for backpressure needs to be able to work with the common problems of log files where data flow is large and the file itself changes constantly and rapidly, whether to be renamed, moved, or truncated, without being overwhelmed by memory concerns.

How Did We Do It?

For TailFile, the open-source package we’ve released, we decided to grapple with the overall problem of file I/O, including the use of streams, the identification of filename changes, and the management of backpressure. As with other packages in the wild, a Node Readable stream implementation is the efficient way to read data from a file. That means the main TailFile class in the new package needed to be a Readable class implementation to consume the tailed bytes. The new TailFile class also uses a stream to read the underlying file resource. This pairing allowed us to use async/await iterators to read the file's data rather than use static buffers that would consume much more memory. When that data is read, it is pushed through the main TailFile implementation as if the data came from a single file, despite the possibility of log rolling.

A differentiator of this code is that it maintains an open filehandle to the log file. This is the key to being able to handle log rolling. When the file changes, the filehandle is still attached to the original file, no matter what the new name (which isn't possible to know) is. Although we cannot use createReadStream() to read from the filehandle, a one-time operation to read the remainder of the file from the last known position can be done. Since we track "start position", the remainder of the file is just fileSize - startPos. By reading that chunk, we will get any data added between the previous poll and the rename, and no data will be lost. Successive polls of the new file are allowed to use createReadStream() as normal, and an async/await flow ensures that we read from the file descriptor prior to streaming data from the newly-created file with the same name.

Another accomplishment of TailFile is its proper implementation of stream backpressure. Backpressure from a stopped consumer can happen if the data is unpiped after running for a bit or if, upon starting, does not immediately add data events or a pipe to put it in “flowing mode.” Following the Readable implementation rules, if the calls to push() return false, then TailFile pauses until _read() is called, signifying that there is a consumer reading the data.

The combination of all of these choices means that TailFile can handle large amounts of data amidst the occasional renaming of the target file without losing any lines.

How Can You Help?

Do you have a project that needs tail functionality in node? Please try our package! Open GitHub issues on the repo for bug tracking or even to add new features. If you like the project, please give it a "star" on GitHub. We are confident that this package can become the best tail package that exists on NPM.

Street Smarts: The Testing Pyramid

Michael Streeter — Mon, 13 Jul 2020 14:26:05 +0000

At my first developer position, I lost my company a $6 million/yr contract within my first six months. Not for lack of trying: I was burning at least 70-80 hours a week, leaving my office at midnight, working weekends, working through holidays, and pretty much living on coffee.

Through all of that, I never once wrote an automated test. I didn't even know that was a thing you could do. So much of my time was spent just running the software, clicking on buttons, and waiting... waiting... waiting for results. It wasn't until I left that company, years later, did I find the importance of fast, reliable feedback in the form of well-written tests.

All the time I spent manually testing could have been spent writing tests. To that end, the journey of writing useful tests needs a map. And thus, the Testing Pyramid shall be our guide.

The Testing Pyramid

While not an original concept, a quick Google search will present the reader with a dozen varieties of Testing Pyramids. Some pyramids have more layers, some fewer, and some combine layers together. This is what I see in my head when I consider the matter:

The width of the layers is an indication of how many tests should be written (not to scale). Unit tests can range from hundreds to tens of thousands, while end-to-end (e2e) tests should probably top out at 1-5% of your unit test count.

The height has four different metrics I consider:

Value: If a test passes, how confident should we be in deploying to customers? High is better than low... obviously.
Depth: If a test fails, how close to the "surface" of the tests is the failure possibly located? Shallow is easier to debug than deep.
Speed: How fast do these tests run? More importantly, how quickly do you receive feedback? Faster feedback is better.
Expense: How expensive are these tests to create and maintain? Cheaper is better.

Let's break down the layers and discuss them with these 4 metrics in mind.

Unit Tests

Unit tests are a fantastic tool and the very foundation for any application. If the code is a garden, then a unit test is a single flower examined under a glass case. While the lowest value individually, they come fast, cheap, and by far the easiest to debug in case of a failure.

Value: Low. A robust test suite could easily have hundreds or thousands of unit tests before a superior level of confidence is reached.
Depth: Shallow. A failure is usually isolated to a single method or file.
Speed: Fastest. In some languages, some test runners can burn through over 1,000 unit tests a second. One Elixir project I contributed to had 20,000+ unit tests running at just under 8 seconds.
Expense: Cheap. Minus some boilerplate, a unit test is often a single line.

Integration Tests

Integration tests are the next step up the pyramid. Their value is increased by testing the connections between individual elements.

Value: Medium. The value of knowing internal elements are correctly wired together is just as important as the individual elements working in isolation.
Depth: Medium. A failure should be isolated internally to a project or repo.
Speed: Fast. The difference in speed usually comes from setup. If you've ever used Java and Spring together, the setup and teardown calls can add an extra second to each test file.
Expense: Kinda Cheap. Again, the setup for an integration test is usually a little more extensive. Changing a requirement can cause several test files to be refactored.

Functional Tests

A functional test should be interacting with your code from the outside. For front-end, that means using a browser and triggering clicks on specific elements. A good functional suite will still control the environment and data while also mocking out third-party services.

Value: High. It's hard to argue with a well-written functional suite. Imitating how an outside force would interact given proper conditions has a high value of confidence.
Depth: High. Unfortunately, a failure could come from almost anywhere in your app. This makes investigating failures take much longer than the lower level tests.
Speed: Slow. These tests often rely on browsers' interactions, CDNs, and turning caching off to be certain. Add in cross-browser testing, and you can significantly extend the time to completion. Depending on how many functional tests you have, it could take 30 minutes to several hours, easily reducing the feedback loop to once a day.
Expense: Medium-High. A functional suite can quickly get out of hand. A small change to a UI element can force a lot of changes. With a poorly formed functional suite (a suite with many false positives or false negatives), teams can lose faith and build a resistance to maintaining or expanding it.

End-to-End (e2e)

An e2e test differs slightly from a functional one by adding third-party services into the mix. At this point, nothing is mocked out. Everything is live&emdash;no-holds barred, bare-knuckle boxing at its finest.

Value: Very High. This is as real as it gets in an automated sense. Not only can you trust your application's behavior but also the connections to outside actors.
Depth: Very High. Failures can be completely outside of your control. This can add a lot of time to investigation, especially if the third-party services are less reliable.
Speed: Slow - Slowest. When adding third-party into the mix, often there are restraints on how often or when a ne2e test suite can run. This could force a once-a-week or once-a-release run cycle, drastically reducing the feedback.
Expense: High. Same as functional.

Manual

Manual testing is a necessary evil. Value is high because, well, you literally tried it out! However, it's very limited. The cost is measured by engineering wages, and the procedure can differ wildly between individuals. Plus, it's completely manual!

Value: Very High. Seeing is believing.
Depth: Very High. Same as e2e. Failures could be anywhere.
Speed: Manual. The feedback loop is relatively fast for a single test, but the tester is locked into the process for the duration.
Expense: Highest. Monetarily, an engineer costs several times more an hour than an EC2 cluster. Predictability relies heavily on both good procedure documentation and good discipline.

Cool! ... So what?

The Testing Pyramid is more of a guideline than an actual map. It doesn't dictate the number or ratio of tests, describe how to write them, or do much else. It's a mental model of what a robust suite of tests could look like, with a balance of value, depth, speed, and expense.

I like to think it can help you answer a few questions:

I have no tests, what should I do?
- Like any good structure, start with a foundation.
All I have are unit tests, what now?
- Start building out a few tests with higher value.
I spend all day manually testing, HALP!?
- You're too top-heavy! Automate those manual tasks.

You get the picture.

Conclusion

Testing is part of the developer craft I wish I had learned much earlier in my career. Would it have saved me from my first big blunder? Probably not. There were plenty of other factors that contributed to that. It could have drastically changed how I received feedback during the development, which would have been nice. I hope this explanation really helps everyone else in starting to understand the Testing Pyramid as I see it.

Postmortem of Incident on 08 June 2020

Sven Delmas — Thu, 09 Jul 2020 23:18:40 +0000

We encountered four interwoven bugs that caused a degradation of service in one of our production instances. As the bugs have particularly interesting implications and some significant lessons learned, we thought it would be prudent to explain the situation to the community to help the next person who may encounter one of these problems. The majority of the problems we encountered were external in nature, and we’re working toward mitigating the impact of future external problems.

Note: All times listed are in UTC.

On 08 June 2020 at 10:35, our on-call SRE received an alert that our production instance needed attention. Investigation began immediately.

What Happened

When addressing an unusual problem with a spike in incoming data, we discovered four distinct but interconnected bugs that compounded the reverberations of the data spike. Our Elasticsearch clusters were exhibiting high thread pool counts that exacerbated the data spike, and when we attempted to increase our hardware allocation, we ran into three external bugs with our disks that triggered index corruption issues, further compounding the problem. While identifying mitigations and workarounds for two of the external bugs, we figured out a way to adjust our system to handle data spikes better and address the thread pool count in the process.

Detailed Times (in UTC)

2020-06-08 ~07:00: A customer’s incoming data started spiking. This spike didn’t start impacting other systems until our first alert at 10:35.
2020-06-08 10:35: We received an alert, which kicked off our incident management process.
2020-06-08 11:31 - 2020-06-09 14:58: We found our first indication of a “noisy neighbor” problem: A customer having a spike in incoming data that was so large other clusters were affected as the load balanced. Our typical mitigation strategy wasn’t working very well, and we started investigating why. One issue we identified was that our prioritization strategies were not working properly.

We also were getting a number of noisy alerts about pod numbers that briefly sent us down the wrong path.

Elasticsearch high thread pool issues arose that were not like any we had seen before. We continued to mitigate with manual intervention to keep the system running while we tried to identify the underlying issue.
2020-06-09 15:30 - 2020-06-10 02:00: We started noticing Elasticsearch pods on various clusters had high thread pools and were constantly restarting. We attempted to fix it by clearing the Elasticsearch cache and adjusting how different components could request CPU resources to free up those resources for Elasticsearch. Neither approach worked, and we continued to manage the environment manually while searching for a cause.
2020-06-10 05:00 - 16:45: We provisioned new hardware for our backend systems to scale up and manage the load better. While setting up the new hardware, we continued to manage the environment and kept hitting the threadpool problem.
2020-06-10 16:45 - 2020-06-11 04:45: When scaling up the new hardware, we discovered a bug, which we’ll call here a group label bug, in a third-party provider that blocked connections from Elasticsearch to storage volumes provisioned on the new hardware. We engaged with the third-party vendor and began to work around the problem manually.
2020-06-11 04:45 - 09:30: While still working around the group label bug, we observed random Elasticsearch pod restarts all around the clusters, seemingly at random, making the Elasticsearch recovery even more difficult. We continued engaging with the third-party vendor while managing the environment as best we could.
2020-06-11 09:30: We identified patterns in the pod restarts. The restarts did not only affect Elasticsearch pods; all disk-backed pods on one hardware node would restart simultaneously, then another random node would proceed through the same restart cycle. This was our first suspicion of a second bug related to the third-party provider.
2020-06-11 09:35: We confirmed that we encountered a bug with Stork, which our third-party storage provider uses for orchestration. A Stork healthcheck that ensures the storage driver is running on a node killed disk-backed pods on any nodes that did not seem to be running that storage driver. Despite these nodes actually having Portworx (our storage driver) running, the healthcheck didn’t identify the running instance and therefore killed all of the pods on the node. We disabled the healthchecks to stop the pods from crashing while the core devs from the Stork project started digging.
2020-06-11 13:00 - 18:15: We continued to manage the environment as it started to stabilize a bit after the healthcheck bug was identified and mitigated, and we continued to work around the group label bug manually to scale up. We still were seeing high thread pool issues, which slowed down Elasticsearch. We decided to lock all Elasticsearch indices from past days and stop shard assignment from all days other than the current day.
2020-06-11 19:15 - 20:00: We disabled thin provisioning on nodes, which is enabled by default, to manage our resources in a more controlled fashion.
2020-06-11 22:30 - 2020-06-12 02:00: We figured out there was a strong correlation between high merge counts and high thread pool queue in our Elasticsearch pods. As we continued to try to identify why we were dealing with high thread pool issues, we continued to manage the environment.
2020-06-12 02:21 - 04:50: While trying to manage the environment, we observed a sudden drop in our deferred backlog and realized that some of the batches in the deferred queue hit our 18-hour retention limit, causing them to get deleted from disk. We immediately worked on the data loss and engaged with our Elasticsearch experts to understand how we might mitigate the data loss.
2020-06-12 05:00 - 08:00: We hit a setback with the group label bug where nodes rebooted and reset back to the state where the bug prevented volume mounting, so we had to re-decommission the nodes. Meanwhile, we worked with our Elasticsearch experts to handle the thread pool issue while continuing to manage the environment and investigate the data loss.
2020-06-12 08:32: We believed we had gotten the situation under control and had updated the status page accordingly.
2020-06-12 09:30 - 12:15: While monitoring the environment since the incident was still open until all Elasticsearch clusters had fully recovered, the oncall SRE noticed some Elasticsearch pods living on the same hardware node crash-looping with an Elasticsearch error about index corruption. This set of errors kicked off more investigation and another round of environment management.

We discovered that one of the nodes was missing primaries on a volume that showed corruption errors. As a result, we engaged with our third-party vendor to see if we were facing disk corruption.
2020-06-12 13:00 - 18:00: We thought we had encountered an NVME bug that corrupted the data and started digging to confirm or refute that theory. By 15:37, we were able to confirm that the NVME bug was inapplicable to our environment and therefore a red herring. Meanwhile, we continued to manage the environment due to the initial noisy neighbor hitting again and the continual thread pool issue, and we continued all of the previous investigations with the assistance of our Elasticsearch experts.
2020-06-12 18:30 - 20:15: We found and started allocating stale primaries for all of the unassigned shards that are around in an attempt to retrieve lost data. The nodes holding the stale primaries had not been touched before the data corruption began, so they were less likely to have corruption.
2020-06-13 02:10: We found the same index corruption error on a different hardware node and its Elasticsearch pods.
2020-06-13 ~02:20-3:00: We started to provision more nodes to handle increased load and allocated more stale primaries on certain clusters to recover data. Some customers on these clusters experienced a degradation in service of the search function while this action was performed.
2020-06-13 03:15: We discovered one corrupted Elasticsearch pod that had a few primary shards from unreplicated indices, so there was no replica shard available to use to recover lost data.
2020-06-13 03:25 - 04:50: We released a set of hotfixes to handle this new version of the noisy neighbor problem. We immediately began seeing significant improvement in the clusters later attributed to the hotfix.
2020-06-13 05:31 - 05:49: The new nodes were ready and bootstrapping began.
2020-06-13 06:07: We again believed we had the situation under control and updated the status page accordingly.
2020-06-13 06:38: A final hotfix was released to handle the last of the edge cases.
2020-06-13 10:57 - onward: We moved to more monitoring the incident rather than actively managing the environment as it appeared the last of the hotfixes worked. We moved toward a stronger focus on recovery.
2020-06-13 18:18: A final update to the status page was added to mark this incident as resolved.
2020-06-13 21:19: We ran into a similar thread pool issue, but it didn’t seem to have the same effect as before with the new hotfix.
2020-06-13 21:56: The oncall SRE observed that the increased speed from the hotfix caused a change in our deferred batch processing. As the problem seemed to be resolving itself generally quickly, the bug was considered a low severity and was relegated to the development backlog.
2020-06-15 15:56 - 17:14: After some further monitoring during a particularly busy period that was similar to the trigger of the initial event, we decided the hotfix was holding and the incident was fully resolved in our incident management process.

Key Factors

First, a quick general note. Elasticsearch is built off of the Apache Lucene platform,¹ and a lot of the following discussions will refer to the underlying Lucene processes and how they affected our Elasticsearch clusters.

The “Noisy Neighbor” Problem

The issue of one customer generating a data spike that kicks off a shift of resources to manage that spike has been something we’ve managed through a semi-automated system in the past. Normally, this spike would be a small blip on the radar at most, requiring minimal intervention to offset. While this factor was mainly a trigger that kicked off the incident, we were able to add a more code-based, permanent solution to our architecture as a result that reduced the need for more manual intervention in similar future cases.

Merge Threads Problem

Lucene has the concept of an index, which is essentially a specialized data store that holds documents--objects that hold data in a standardized format. These indices are further broken down into segments, or sub-indices, that are immutable, or unable to be changed once created. Lucene will merge these segments together to create larger segments as more segments are generated when more data is added to the overall index.² Elasticsearch’s shards are essentially Lucene indices, and therefore, by extension, Elasticsearch shards also have this merge function. Elasticsearch also uses the thread pool design pattern throughout its architecture to manage the functions like bulk insertion of data into shards and flushing indices to disk.

In our case, we were constantly hitting a high thread pool count while data were getting written, which caused significant lag times. Generally, this high thread pool count is related to the noisy neighbor problem with significant spikes of data incoming to the system. Here, the coupling of the noisy neighbor problem with the external factors we were dealing with made the issue appear in an unusual form. The fact that the hotfix reduced the impact of high thread pool counts on our architecture bolsters this argument.

Incidentally, the thread pool spike could also account for the initial impact on search when the data spike occurred. High thread pool counts on merge can be a symptom of a large number of segments being created all at once as large amounts of data are added to the index, and that rapid turnover of memory cache can cause invalidation of search results and therefore degradation of search availability.

External Concerns

Index Corruption and Storage Timeouts

After some investigation, the index corruption issue we saw on our longer-storage systems seemed to stem from a connection timeout to our third-party storage system, and this symptom does not seem to have been isolated to our systems.³ The underlying cause was out of our scope, but the technical recovery process is worth talking about separately.

Elasticsearch duplicates clusters across an entire ecosystem, creating primary and secondary (replica) shards on different clusters.⁴ The source of truth for a document in the cluster is the primary shard. Generally, whatever happens on the primary shard then is replicated to the secondary shards, and new secondaries can then be created or destroyed at any time. In our case, the secondaries are created over time, leaving a trail of “stale” copies of the data until those secondary shards are eventually assigned and wiped. The newer secondaries had the same data corruption issues as the failed primaries. However, the stale secondaries had the majority of the data from the primary shard without the corruption problem, and they luckily had not yet been wiped. We were able to manually reassign the secondaries as primaries, buffering them to memory and then flushing them to disk. In doing so, we were able to recover nearly all of the missing data from the corrupted indices.

Other External Bugs

Two other factors involved external sources where we ran into unique bugs that have since been reported and solved by third parties. As these bugs belong to other sources, we won’t spend too much time here explaining them. One bug involved labeling of groups of volumes in Kubernetes-based clusters.⁵ The other, as noted, was related to the Stork health check call. The health check didn’t seem to identify the storage driver that was running or that the storage driver was still booting, and that caused all of the disk-backed pods to get ejected from a node. For more information on this last bug, we refer you to the relevant pull request on the open-source project.

In a larger sense, one of the hallmarks of a mature platform is the ability to mitigate external factors swiftly with minimal impact on the customer’s experience. That hallmark is why the industry balances loads across datacenters and encourages rolling updates and restarts across microservices architectures, as just two examples. The aim of mitigation strategies is to handle issues gracefully, such as seamlessly transitioning a user’s session to a different datacenter if the first datacenter’s connectivity falters or preserving a known working function if an update to an external dependency breaks part of an application’s functionality. No one expects that a third party’s systems work all of the time. Instead, we all work off of service-level agreements and uptime statistics and then attempt to have worst-case scenarios planned and ready should something we rely on go down. And we find solid third-party partners that react swiftly to reports that something isn’t working properly.

Next Steps

We are continuing to monitor and tweak the hotfix we released to production that seemed to address our “noisy neighbor” problem once and for all. There will definitely be more optimization that is needed simply because it was a hotfix released at speed to address a pressing issue in production. Examples of this optimization include the following potential investigations:

Improving our cluster balance based on the new metrics around data spikes to ensure no single cluster can take over all of the processing power available in a cluster when encountering a spike in incoming data.
Decoupling our clusters by ensuring our worker pools pause processing on problematic clusters while maintaining throughput on all other clusters.
Revising our auto-resume functionality, which would automatically attempt to resume problematic clusters in an exponential backoff fashion but which actually compounded the issue as work from the slow clusters would leak into the worker pool and slow the total worker throughput down. For now, the SRE team will make the determination of when to pause and resume clusters, and we may modify our data pipeline to provide finer throttling tools for our SRE team when there are spikes in data.

In regards to the external factors that complicated this incident, we generally were lucky to get engagement quickly from our third-party partners. We have some new internal mitigation techniques to standardize in case we run into similar issues in the future:

First, we are working to ensure that we are running the same version of our storage driver in every environment. Due to code freezes requested by various business partners and a transition to a different storage driver that is more robust, our environments have gotten out of sync, and we did not have a set way to catch up everything in a safe, controlled manner. We are standardizing that catchup process and adding in additional testing capacity to ensure that any process we do land on does not introduce further errors. We are moving everything to the more robust storage driver, as well, which should address some of the errors we have encountered in other incidents.
We are adding additional testing capacity to our post-upgrade and post-deployment processes, such as canary processes, to reduce the potential for uncaught or unknown bugs to affect our production environments. This addition will help in the case of unidentified kernel bugs, for example, so we can immediately mitigate and roll back as necessary.
We are adding additional checks and tooling to ensure that no shard is unreplicated in any of our environments to prevent data loss as we experienced here. While we may still run into similar index corruption issues in the future, replication and better backups will mitigate that external factor. We’re also standardizing the manual secondaries assignment to replicate the process in any necessary future incident.
We are standardizing the procedure generated during the incident to manage the high thread pool issue in Elasticsearch. By writing up and standardizing this playbook, we aim to ensure anyone on the team can replicate the process efficiently and accurately, addressing issues as soon as they arise rather than allowing them to build up while we go back to remember how to manage the issue again.
Finally, we are improving our tooling to ensure faster transparency with our customers when we do encounter incidents that affect them. We aim to ensure that customers who are immediately affected get faster communications and provide better information to our customer support and success teams.

Wrap Up

While this incident mostly ended up being compounded by external factors, we learned more about how to mitigate those factors and addressed some underlying issues with our infrastructure that the attempts to mitigate the problems uncovered. Kudos goes to our partners that stepped in immediately to help understand the problem and work on their respective ends to address underlying issues that we encountered. We also were pleased that our new notification systems for customers within the app seemed to work so well in our first test of our initial action items from the last postmortem.

See https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up for a still-relevant, solid dive into how Elasticsearch is built on Lucene. ↩
See http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html ↩
See https://forums.portworx.com/t/failed-to-mount-volumes-in-pod-http-error-404/445 ↩
https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html ↩
https://docs.portworx.com/reference/release-notes/portworx/#2-5-0-2 ↩

Serverless Logging Performance, Part 2

Laura Santamaria — Fri, 26 Jun 2020 16:33:39 +0000

When thinking about serverless applications, one thing that comes to mind immediately is efficiency. Running code that gets the job done as swiftly and efficiently as possible means you spend less money, which means good coding practices suddenly directly impact your bottom line. How does logging play into this, though? Every logging action your application takes is within the scope of that same performance evaluation. Logging processes can also be optimized just like for any process your code spins up. In this series of posts, let’s dive into how you can think about logging in a serverless world.

In part 1, we examined the concerns around cold starts and message construction. Now, we’ll talk about how logging objects (structured logs) instead of text can affect the cost of your serverless architecture.

Part 2: Performance, Episode 2

Best practices for logging currently focus on structured log objects rather than text-based messages, mainly because machines, rather than humans, are the primary audience for logs in our world of automation. Serverless logging is no different, and it’s even more important than other systems considering the sheer volume of data a serverless architecture generates over a similar amount of time and the fact that the majority of serverless management tools rely on parsing log data to function. Alerts, monitoring, performance metrics, and even triggers for other actions all need to check logs and work off of them to help you manage your serverless systems.

Structured Logs

First, what are structured logs? Structured logs are data-enriched objects that enable machines to parse out relevant information without relying on regular expressions or other text processing. Think of a structured log as a text-based message (I’ve been calling them strings, but I know that the term can mean a lot of different things based on programming language) with associated metadata attached. The text-based message is for the human poking at the logs. The metadata, though, is really for the machine, and having that metadata ensures a machine will accurately identify a type of log line 100% of the time assuming the incoming data is accurate.

If you were only to use text-based logging, you run the risk of two messages being so similar that a regex-based parsing solution would flag them both as the same type when they are completely different. In addition, the extra processing time needed to parse a text-based message increases the cost associated with a serverless system. We’ve already explored that the time needed to run a serverless call is pretty well correlated with the cost of running that call, so we know that time is money here. However, is a structured log really more performant and therefore more cost-effective in a serverless world? To answer that, let’s revisit the Python example from the previous article.

Construction, Revisited

For structured logging with Python, there’s a lot of small libraries out there on PyPI we could explore. I chose to work with the standard logging library and a library called structlog. I did a similar benchmark test to the one I did last round using a few different configurations of structured logging:

Method	With 1 Variable (microsec)	With 2 Variables (microsec)	With Multiple Variables (microsec
text, fastest string	23.2711	23.6432	25.7583
structured, built-in	31.3718	33.7994	49.9785
structured, structlog	36.3576	40.0466	60.9309
structured, structlog performance run	12.8722	13.1769	20.8687

Time Context	Time Ran
real	3m23.373s
user	2m55.168s
sys	0m12.598s

In this case, the various methods are as follows:

text, fastest string to provide a sort of control. This method was the %-formatting direct call from last round.
structured, built-in is using the built-in standard logging library and the method found in the logging cookbook for generating a structured logging setup¹^,².
structured, structlog is using the structlog library with a default setup.
structured, structlog performance run is using the structlog library with a performance tune based on the structlog docs.³

I’m adding in the time to run the full script so you can see how long I ran the benchmark; the timing was generated by running time at the terminal. This addition is to ensure I’m running the script over a long enough period of time to account for system differences (e.g., silicon and hardware differences, I/O differences in tuning, OS processes).

You’ll notice that, believe it or not, the performance tune of structlog is significantly faster than even the fastest text-based logging method we had. That in and of itself is remarkable, especially considering the library still calls the standard library in its configuration. The likely scenario for this performance is twofold. First, the structlog library has a method called cache_logger_on_first_use that reduces build time. This choice is very similar to the performance boost you receive when you reduce your cold start times for the overall application. Second, the JSON serializer in the standard library is not all that fast compared to other libraries out there.⁴ The structlog docs encourage using a different JSON serializer. As I’m running this test on 3.7.4, I have access to RapidJSON’s port to Python from C++ and therefore switched the serializer to RapidJSON. Both of these performance boosts should explain the significantly faster run time.

The Meaning of Speed

To go back to why this matters specifically for serverless systems, we need to consider why structured logs are preferred for serverless systems. Automation triggered by events captured by logs need the consistency of structured logging. Most folks assume that generating the string is faster than generating an entire object that includes metadata, though, and therefore prefer generating the string. As is seen here, though, generating the structured logs in a specifically tuned manner can still give you the use you need with the cost savings that you want.

However, throughout the start of this series, we’ve only been considering speed. Another part of the equation that we haven’t considered is networking throughput. When we start talking about the differences in text-based logs and structured logs, we have to consider how serverless providers charge for network traffic. In the next few parts of the series, we’ll examine that angle along with understanding how much memory is required to build logging for your serverless application.

https://docs.python.org/3.7/howto/logging-cookbook.html#implementing-structured-logging ↩
Note that I used a slightly different call than the one in the most current docs as I’m running 3.7.4 for this series. ↩
https://www.structlog.org/en/stable/performance.html ↩
https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html#serialization ↩

Serverless Logging Performance, Part 1

Laura Santamaria — Mon, 15 Jun 2020 15:11:28 +0000

When thinking about serverless applications, one thing that comes to mind immediately is efficiency. Running code that gets the job done as swiftly and efficiently as possible means you spend less money, which means good coding practices suddenly directly impact your bottom line. How does logging play into this, though? Every logging action your application takes is within the scope of that same performance evaluation. Logging processes can also be optimized just like for any process your code spins up. In this series of posts, let's dive into how you can think about logging in a serverless world.

Part 1: Performance, Episode 1

I know that one of the benefits of serverless architectures is not managing your own hardware, but you still need to consider the hardware your code requires when thinking about managing cost. Most serverless providers charge by some combination of the amount of memory and unit of time needed by your function or application. So, in a significantly simplified pricing model on two offerings each just for argument's sake, AWS charges per request and byte-second of memory (Lambda) or per vCPU byte-second and byte-second of memory (Fargate) whereas GCP charges per vCPU byte-second (Cloud Run) or per invocation (Cloud Function). There's a lot more involved in pricing structures for these as-a-Service systems, including the cost for network ingress or egress, storage, data stores, and analytics. However, we're just going to examine the actual compute power here when discussing a deep dive on logging.

Knowing how much compute power you'll need for your code drives the decision for different tiers, and therefore performance equals money. Your choices around how to log data can affect those performance metrics. Let's start with examining the first time your code spins up.

Cold Start Considerations

One of the biggest concerns with serverless performance and efficiency is the cold start, or the first spin-up of an instance when there's no warm (recent) instance available. During the start of a serverless system, the service brings a container online, and then the container starts running your specific system. At the point of the container starting to run your unique instance, the meter starts running. The first thing the container does on a cold start is a bootstrapping process where it installs any dependencies you expect to need. If you've ever watched a Docker container image building, you know that this process can run out on the scale of minutes. Then the container's environment is set, and your code begins to run. On a warm start, on the other hand, the environment comes preset, and your code begins running immediately.

As you might guess, the bootstrapping process in a cold start can be expensive. This need for efficient setup is really where dependency management becomes a not-so-secret weapon for lowering the cost of a serverless system. Your logging library of choice is no exception. Since setting up dependencies does take compute time, you have to factor in that time in your performance calculations. As a result, the ideal state for logging in a serverless architecture is to use built-in methods wherever possible. That means using Python's built-in logging library, for example, to reduce calls to PyPI. If you can't use a built-in library, such as if you're running a NodeJS function and need more than just console.log(), the best logging library for you is the one that balances the features you want with the least number of dependencies needed to make those features happen.

Construction

You should examine how you are constructing logging messages and objects. If there is upfront processing going on when constructing a message or an object, consider offloading that processing time to see if you can reduce usage. To go back to Python as an example, consider this simple benchmark set¹, run from Python 3.7.4:

Method	With 1 String (microsec)	With String and Integer (microsec)	With Multiple Inputs (microsec)
%-format	19.376829	19.629766	21.335113
%-format, direct call	20.347689	19.808525	20.626333
str.format()	20.324141	20.329610	21.905827
f-string	19.461603	19.875766	20.073513
concatenation	18.930094	19.837633	24.505294
direct log, as variable	21.637833	--	--
direct log, direct insertion	21.816895	--	--
direct log, as list	--	22.872546	23.445436

These timeit results came from using different string generation methods and putting them into a logger to drop a human-readable message into the logs, adding complexity as we went. The methods in the benchmark are

the original %-formatting method generating a string and then passing that to the logger,
a direct method passing a format string with %-formatting and then the variables to the logger,
the str.format() method generating a string and then passing that to the logger,
the f-string formatting method generating a string and then passing that to the logger,
a generic string concatenation method generating a string and then passing that to the logger,
direct pass of a variable pointing to a string to the logger to generate the message internally,
direct pass of a string to the logger to generate the message internally, and
direct pass of a list to the logger to add to the message at the end.

As we add more integers and strings, the complexity of each call rises, and we start to encounter the different costs associated with each string conversion call. The concatenation method, for example, requires a call to the str() method to convert the integer into a string so it can be concatenated. You’ll notice that, depending on need, different methods are more efficient than others. Under the hood, Python’s logging library uses %-formatting, the oldest method, to ensure backwards compatibility.

We’ll spend some more time in a different series digging into Python’s logging library to understand better benchmarks, but this example should give you a fairly good sense of why you need to consider how you construct your logging messages if you’re going text-based. These numbers may not seem like much, but, in reality, you probably are making hundreds of calls to your logger on a larger instance than this. The difference between the fastest method and the slowest method when you have multiple inputs is on the order of 4.5 microseconds, and this message generation example is extremely simple in comparison to what you typically need for logging in a serverless application. When milliseconds count, this kind of consideration is a priority.

Next time

Once you've considered how to improve the performance of startup and basic text-based messages in your logging setup for serverless, you need to start thinking about crafting logging objects to transmit over those networks in the smallest possible package as fast as possible. In the next post, we're going to dive into logging objects versus strings; and then in subsequent posts, we'll start thinking about memory allocation, concurrency and state, and security.

If you want to test these benchmarks yourself on your own system, you can find the code at https://github.com/nimbinatus/benchmarking-logs. The repo is a work in progress, so this code may be adjusted when you read this. ↩

Postmortem of Root Certificate Expiration from 30 May 2020

Sven Delmas — Wed, 10 Jun 2020 22:01:00 +0000

Originally published to our blog

We had a partial production outage the weekend of 30 May 2020, and we missed a few things, outlined in the next few sections. Since others may encounter a similar incident in the future, we thought it would be worthwhile to share our experience and help others learn as much as we have. We’re addressing where we failed through better customer communication, improved planning for future workarounds, accelerated completion of our CI/CD improvements, and stronger endpoint alerting.

Note: All times listed are in UTC.

On 30 May 2020 at 13:04, a customer alerted us in our public Slack that their LogDNA agents suddenly stopped shipping data. Our on-call SRE began an investigation immediately.

What Happened

The certificate chain for our systems relied on the AddTrust External CA Root. That root certificate expired on 30 May 2020 at 10:48, which caused a certificate expiry error that broke the TLS handshake between some of our customers’ agents and libraries and our systems. Any system that attempted a new TLS handshake with our various endpoints with one of the clients using an older certificate authority certificate store or an older TLS implementation failed to connect.

Detailed Times (in UTC)

2020-05-30 10:48: The certificate expired as noted.
2020-05-30 12:59: We received our first customer ticket reporting something wasn’t working correctly.
2020-05-30 13:04: We first received word from a customer in our public Slack that their agents stopped shipping data.
2020-05-30 13:13: The SRE on-call finished verifying the customer report and opened an incident.
2020-05-30 13:52: In analyzing the data coming in for ingestion, we misunderstood the impact of the certificate expiration due to a discrepancy in two different datasets. Our data did not demonstrate a complete loss of ingestion traffic, so while a severe problem, we thought the issue was more isolated to one version of our agent. Based on customer reports, we thought that only the Docker-based image of our v1 agent was affected. As such, we focused on releasing a new v1 agent build and identifying a way for customers running older operating systems to update their certificate lists.
2020-05-30 15:22: We identified the patch for Debian-based systems and attempted to apply the same patch to our v1 agent image. Then, we realized that the problem with the v1 agent was due to how NodeJS manages certificates (described under NodeJS Certificate Management). We focused our efforts on rebuilding the v1 agent with the correct certificate store for both Docker and Debian-based systems.
2020-05-30 19:10: We started validating packages internally before shipping to customers. We also thought we identified the solution to the lack of any drop in ingestion traffic as existing agents did not need to attempt a new TLS handshake.
2020-05-30 19:13: We pushed a new v1 agent build that allowed customers to restart ingestion on most platforms and documented hotfixes for older Debian-based systems in our public Slack. We did not realize our build had failures on AWS due to a hiccup in our CI/CD process.
2020-05-31: We continued to think that the patched image solved the problem and pointed customers to the new 1.6.3 image as needed.
2020-06-01 13:07: We received reports from multiple customers with data that the new 1.6.3 image had issues that would cause CrashLoopBackOff. We discovered a bug in our release chain that prevented us from releasing a new image. We were back at square one.
2020-06-01 15:00: We fixed the bug in our release chain and started generating new packages. An intermittent hiccup in service with our CI/CD provider caused a further delay.
2020-06-01 17:14: We released the new packages and started exploring the option of switching to a new certificate authority. We wanted to remove the need for customer intervention and to avoid issues with how NodeJS vendored OpenSSL (described under NodeJS Certificate Management and Certificate Update Process).
2020-06-01 18:13: We began the process of updating our certificates on testing environments, pushing the changes out to higher environments one by one after testing was complete.
2020-06-01 21:46: We completed a switchover to a new certificate authority and certificate chain. This fix meant all systems except those that would require manual certificate chain installation, such as those using Syslog ingestion, would immediately begin ingestion again without any further customer action. Those systems with manual certificate chain installation were provided with further instructions and pointed to the new chain on the CDN.

Key Factors

NodeJS Certificate Management

Along with our libraries and direct endpoints, we have two separate agents we’re currently supporting: v1 agent, which is written in NodeJS, and v2 agent, which is written in Rust. The v1 agent has been kept on an older version of NodeJS to provide compatibility with older operating systems. That older version of NodeJS uses a default list of trusted certificates for certificate authorities that did not include the new certificate, and NodeJS overall does not read from the local system’s trusted certificate authority list.

NodeJS, similarly to Java, ships with a bundled list of trusted certificates to ensure the security of TLS calls. Browsers do the same thing except they manage their own lists; NodeJS uses Mozilla’s list as the core developers deferred to Mozilla's "well-defined policy" to ensure the list stays current. In the past, there were a number of calls for NodeJS to enable teams to add new certificates or otherwise enable better management of the certificate store. As of version 7.3.0 (coupled with LTS 6.10.0 and LTS 4.8.0), the core developers of NodeJS added the ability to include new certificates in that trusted list. Before that release, end-user developers and ops teams would have to recompile NodeJS with their own certificate additions or patches. Coincidentally, the NodeJS community raised a request to remove the AddTrust certificate and put the proper certificate in, and that fix landed on 01 June 2020. We discovered this change in the source code itself during the postmortem phase of this incident.

NodeJS also ran into an issue with how it vendored OpenSSL. Different versions of NodeJS used different versions of OpenSSL, and some older versions of OpenSSL gave up when finding invalid certificates in a given path versus trying alternatives.

How does that work, exactly? To ensure we’re all on the same page, let’s talk about certificate chains. The basic chain involves three pieces: a root certificate from a certificate authority, an intermediate certificate that is verified by the root certificate, and a leaf certificate that is verified by the intermediate certificate. The root certificates, as noted, are generally coded into operating systems, browsers, and other local certificate stores to ensure no one can (easily) impersonate a certificate authority. Leaf certificates are pretty familiar to most technical teams as they are the certificates that a team receives as a result of their request to a certificate authority. Intermediate certificates, on the other hand, serve a few functions. The most important function is they add a layer of security between the all-powerful root certificate with its private keys and the external world. One of the other functions, however, that’s the most relevant here is the need to bridge between old root certificates and new ones&emdash;a function known as cross-signing. Certificate authorities release two intermediate certificates, one for each root certificate, that then both validate a leaf certificate.

Older versions of OpenSSL, specifically 1.0.x and older, have issues with this system where they follow the certificate path up the chain once and then fail if that path leads to an expired root certificate. In newer versions, OpenSSL attempts to follow an alternate chain when one is available, such as in this case where there was an additional cross-signed intermediate available that pointed to the new root. This last issue caused problems when we built our systems on different versions of NodeJS that ship with older versions of OpenSSL, and it also caused problems with other systems because of the next point.

Certificate Update Process

When we were updating certificates in the past, we only updated the leaf certificate (the bottommost part of the certificate chain) rather than including the intermediate certificate. Sectigo has offered intermediate certificates that cross-signed the AddTrust certificate with the new USERTrust RSA certificate to ensure that older systems supported. Since we didn’t add the intermediate certificate during our update process, we missed adding the cross-signed intermediate certificate. As noted, this omittance caused errors with OpenSSL and GnuTLS on other older systems such as older Debian or Fedora builds, not just our NodeJS build.

Endpoint Monitoring

Finally, and possibly most importantly, we did not have any external endpoint monitoring to alert us when this certificate chain broke. We ended up relying on customers flagging the incident for us.

Next Steps

Outage and Customer Communications

We weren't as prompt, nor as expansive, as we should have been in communicating with our customers. That is changing immediately, starting with this public postmortem. We are discussing steps to ensure that we have the right required channel of communication for customers based on incident severity, and we will implement them as quickly as possible. Here’s the most up-to-date version of our plan:

Low-severity incidents: Our customers can expect timely status page updates.
Medium-severity incidents: In addition to status page updates, we'll deliver in-app notifications for all impacted customers. Note that we are working on this functionality.
High-severity incidents: You can expect the above communications, as well as direct email notice to all impacted customers.

Ultimately, if there is an incident, we owe it to every customer impacted to be notified as early as possible, along with consistent status page updates. Our customers can expect effective communication moving forward.

Workarounds and Solutions

We know customers rely on us every day to help with troubleshooting, security, compliance, and other critical tasks. We were not fast enough in providing potential workarounds for the full problem. We are starting work on documenting temporary workarounds for future use so we can respond faster during an incident.

We know some customers need a backfill solution for security or compliance. We are aware of the problem and are in the extremely early stages of defining solutions and workarounds for future use. Given the current complexities, we do not recommend attempting to backfill missing logs from this recent incident. Some of the difficulties we need to address for a backfill solution are identifying when the problem actually requires backfill, minimizing impact of any potential option on throughput to livetail, reducing duplicate entries, and ensuring that timestamps are preserved and placed in order. For example, any log management system can see occasional delays when coupled with slower delivery from external systems, and it might appear that a batch of log lines was not ingested when, in fact, they are in the pipeline and would appear on their own without any intervention. As these difficulties could seriously warp or destroy data if not handled properly, we are starting with small, noncritical experiments and ensuring that any process we eventually recommend for any workaround our customers need is thoroughly tested, repeatable, and reliable.

Technical Initiatives

We’re improving our CI/CD systems to speed up releases when we need to. We actually were in the process of doing this very task, which caused some of the delays in releasing an initial patch immediately as we discovered issues that we needed to fix.

Our current monitoring solution does not include external endpoint monitoring. To increase our level of proactive alerting and notifications, we have already started talking to vendors to supplement our current monitoring so we are notified the moment any of our endpoints one of our own systems goes down. We are also working to identify the right automated solution to check certificates that includes the entire certificate chain rather than just the tail end of the certificate chain.

Wrap Up

While this certificate expiration incident affected multiple software providers over the weekend, we should have known of the issues proactively, we should have been more actively communicating updates during the incident, we should have been faster at identifying and providing workarounds, and ultimately, we should have resolved the issue much faster. Our commitment to you is that we are going to do better, starting with the fixes outlined above.

Adding Logs to Legacy Applications

Laura Santamaria — Wed, 27 May 2020 15:02:29 +0000

As the final interactive in my mini-workshop at DeveloperWeek Austin 2019, I posed the following scenario to the audience:

You have a legacy application that has not been updated in 5 years. The system is running Python 2, which is sunsetting in January 2020. The system recently had its first incident in nearly 4 years, and your team was among the group that had to bring it back up. The logs that you received were not very helpful, and bringing the production instance back up ended up being a lot of trial and error.

Management has decided all applications must be on Python 3 by the end of code freeze in January 2020. Your team has been tasked with updating the application to use Python 3. It's the ideal time to add proper logging. How would you go about planning and executing that logging update?

The scenario generated a lot of discussion, and people had some very good answers. However, someone on Twitter (thanks for the question, @russellyazbeck!) pointed out to me that I completely forgot to provide my take on the same scenario. Whoops! I did respond in a thread on Twitter, but I'd like to lay it out and expand on it a bit here as it's definitely easier to find.

First and foremost, the members of the audience that started with bringing everyone together were spot on. You can certainly have your own data that your immediate team gathered during the incident, reaching out to the rest of the teams that responded to the incident and gathering their data is also important. Why? They have different perspectives. It's very easy to get lost in your own context and stop noticing quirks or issues unique to your programming language of choice, to a platform you're somewhat familiar with, or even to a team that you have worked with in the past. Ask anyone who has ever taught a concept to someone else successfully, and you'll find that nine times out of ten, those people will mention how they were surprised initially where someone got lost or how hard it was to avoid jargon that needed further explanation. So ensure you get data from anyone who was involved in that incident so you can start understanding what other people might need to understand the same issue.

Once you've gotten everyone together, start talking about the ideal incident response and the ideal data that you would have gathered. What should those logs have shown? What data did you actually need? In addition, what data was just noise? Was there any data that duplicated information? In addition, you can use this time to discuss which log levels would be useful for each type of data. Log levels help reduce or structure the noise coming from a logging system. Since a good production system allows for tuning the logs based on which environment you're in (dev, test/QA, staging, prod, or some combination thereof) and since any and all teams doing ops work on said system would likely love you if they don't have to decide whether a log raised by your system is a deprecation warning (WARN) or something that isn't acting right but won't take down the whole house of cards (ERROR) or something that took down everything including your databases and networking (CRITICAL), coming to a consensus on which logging levels are necessary and how to define them is really important both now and in the future.

Now that you have a much better idea of what kind of data you actually needed, you would pick a library or logging structure that could help give you what you needed. If you were in a scenario that had a bunch of other apps going, as would be likely in a scenario like this where there's a legacy application and multiple teams that likely are working on multiple projects, I'd definitely look to a structured logs library like structlog. While I could roll my own on top of the standard logging library, my guess is the rest of the team (and future team members) would likely find a library with good docs and standardized uses much easier to use to maintain good logs in the long run. An opinionated logging library would likely be best to ensure everyone logs well. Personally, I wouldn't use only text logs for this sort of situation, even if there's only one application that your company owns. Start as you intend to continue so that it's a lot easier down the line to ensure future systems are easier for others to onboard onto with similar features, common style, and other familiar elements. However, you have to keep in mind that this mentality includes a point that I made in the workshop: The audience of structured logs isn't really a human or a set of humans, but rather many machines parsing the data for you.

By the way, I want to point something out. I chose the Python 2 to Python 3 conversion scenario because it's one of those moments that's an ideal time to add logs. You're already in the codebase digging around and touching everything. You're getting to know what's there, so you're unlikely to skip anything major (well, assuming you're not using six or the built-in 2to3). It's also the ideal time, as noted by a few folks in the audience, to add in a deprecation warning for anything that relied on the Python 2 conventions for hitting the application. However, it is a bit of a red herring. You can add these kinds of logs to any system at any time. Legacy systems are often viewed as the most dreaded to work with, hence the scenario, and incidents are one of the ideal times to take a step back and understand what data is flowing through your system. However, you can use this same thought process for a modern application, an application that hasn't had an incident, or even an application that's brand new. Walk through the scenario as if your application just had that moment happen (or knock it over deliberately in dev or staging when those environments are not in use for anything critical), and see what comes out of the brainstorming exercise with the various teams that would theoretically be involved. Then add logs and monitor the outcome.

How else would you respond to this scenario I laid out here?

LogDNA Engineering

Laura Santamaria — Wed, 20 May 2020 18:27:10 +0000

Let’s get this blog started!

What’s this blog all about?

We’re going to focus on technical discussions around logging and discuss some of the challenges and failures that we meet along the way. You’ll hear from me (Laura) a lot, but you’ll also hear from members of our engineering teams.

You won't see product pitches or other vendor-y posts here. This is a technical blog, not a sales blog. We will likely mention LogDNA sometimes—we can't help but talk about the company when we're discussing unique challenges we've faced along the way. However, the focus will always be on the technical side of things.

If there's any topics around logging you'd love to hear about, let us know!