Forem: gdcohen

3 ways ML is a Game Changer for your Incident Management Lifecycle

gdcohen — Mon, 12 Apr 2021 16:55:26 +0000

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. There's the immediate stress of mitigating the issue as quickly as possible, often at odd hours and under severe time pressure. There's the bigger challenge of identifying root cause so a durable fix can be put in place. There's the aftermath of postmortems, reviews of your monitoring and observability solutions, and inevitable updates to alert rules. And there's the typical frustration of wondering what could have been done to avoid the problem in the first place.

In a modern cloud native environment, the complexity of distributed applications and the pace of change make all of this ever harder. Fortunately, AI and ML technologies can help with these human-driven processes. Here are three specific ways:

1. Drastically cut incident remediation times

The toughest incidents are ones where the symptoms are obvious, but the root cause is not. In other words, they are easy to detect, but hard to root cause -- as seen in recent outages at GCP, Slack and Snowflake. SREs and engineers can spend hours digging through dashboards, traces, and inevitably -- scan millions of log lines. There might be clues to narrow the scope of the problem -- perhaps a set of services, containers, or hosts -- but ultimately there is a search for the unknown. Is there a new type of error? Or any unusual events? Or a significant deviation from the normal in event patterns? And when there are many of the above -- how do they relate to each other?

Really experienced engineers develop instincts to help with this hunt for the unknown. But machine learning is very well suited to this problem -- it can keep tracking the evolving (but healthy) event patterns and their correlations, quickly surface unusual ones that explain root cause, and even summarize the problem in plain language by matching the events against known problems in the public domain.

2. Eliminate the Alert Rule hamster wheel

The second pain point is the need to revise and continually evolve alert rules and settings that give you early warning. While a pure approach might only monitor a narrow set of user-impacting health metrics and symptoms, that can make it harder to identify root cause. So in reality most organizations set alerts for a blend of user facing symptoms as well as underlying health indicators (errors, latencies, reconnects, resource exhaustion etc.) After a particularly painful incident, it is natural to review and modify alerts -- adding new ones or modifying thresholds each time a new type of issue is encountered. The challenge is that as long as new types of problems continue to occur, this is a never ending game of catch up.\
\
Machine learning can reduce this burden considerably. The simplest approach is to configure a set of "signals" which will trigger ML driven reports. Signals could of course be real incidents, but they could also be symptom alerts. For example, many teams watch for the overall error frequency -- if it spikes relative to recent trends, you know something is wrong, but not necessarily what. Well, you can use the same simple alert as a trigger for machine learning to scan the logs and metrics for that deployment around the time of the alert -- identifying unusual events/sequences and anomalous metrics that could explain the spike in errors. Even better, machine learning can fingerprint these sequences -- so when a particularly noteworthy root cause is detected, you already have a pre-built alert rule you can simply connect to an alert channel.

3. Proactively catch silent bugs and inform developers early in the cycle

In the not too distant past, new releases were tested extensively before deploying to production. This allowed for deliberately constructed test plans, stress tests and an opportunity to catch bugs that might have potentially nasty downstream consequences. Today, deployment cycles are much faster, drastically shrinking the time to do any of the above. There is now a trend towards "test in production". Although many teams do use staging environments and approaches like chaos tools, it's more likely that subtle bugs will only surface in production when they result in user complaints or visible symptoms.

By using ML to surface new or unusual errors, event patterns and patterns in the metrics, machine learning can quickly become a developer's best friend in proactively surfacing subtle bugs early, before they impact users. For instance, using our own ML technology, the Zebrium engineering team recently caught a bug related to a malformed middleware SQL query, that under certain conditions prevented users from completing their intended workflow. Another example involved an exception that was handled in a try/catch block which emitted an error log message but was otherwise silently breaking outbound webhook notifications. Our developers have come to appreciate the proactive detection from our internal Zebrium service to catch these kinds of bugs early, before they can do real damage.

Conclusion

As more users rely on software applications, the pressure to shrink MTTR and the stress of troubleshooting incidents under pressure all grow proportionally. Over the last decade a rich set of observability tools have emerged to help detect problems easily, but troubleshooting has remained very manual, driven by the instincts and experience of the engineer on call. New approaches that apply machine learning to tackle this problem can help by drastically reducing MTTR, catching new bugs early, and reducing the manual effort involved in tasks like creating RCA reports and maintaining alert rules.

If you're interested in using ML as part of your incident management lifecycle, please visit Zebrium.

Posted with permission of the Author Ajay Singh @ Zebrium

Real World Examples of GPT-3 Plain Language Root Cause Summaries

gdcohen — Thu, 25 Mar 2021 18:22:15 +0000

A few weeks ago, Larry our CTO, wrote about a new beta feature leveraging the GPT-3 language model - Using GPT-3 for plain language incident root cause from logs. To recap -- Zebrium's unsupervised ML identifies the root cause of incidents and generates concise reports (typically between 5-20 log events) identifying the first event in the sequence (typically the root cause), worst symptom, other associated events and correlated metrics anomalies.

As Larry pointed out, this works well for developers who are familiar with the logs, but can be hard to digest if an SRE or frontline ops engineer isn't familiar with the application internals. The GPT-3 integration allows us to take the next step -- distill these root cause reports down to concise natural language summaries by scanning the entire internet for occurrences of a similar incident, and extracting brief "English" descriptions for a user to scan.

After a few weeks of beta testing this feature with a limited group, and examining results from a couple of hundred incidents, we're now ready to share some exciting results and expand access to ALL Zebrium users, even those on free trials.

In a nutshell -- it works so well and in such a wide range of scenarios that we felt most users would benefit from having access to it. These summaries are both accurate and truly useful -- distilling log events into a description a frontline or experienced engineer can easily understand.

First the caveats

This is still an early-stage feature for us, and there are cases where GPT-3 veers into guesswork and suggests summaries that seem related to the core RCA report, but aren't exactly right. To make sure users know this, we tag the summaries with an "EXPERIMENTAL" badge in the UI.

There are also times the specific RCA report does not generate a particularly illuminating natural language summary beyond recapping the key log event(s). For instance --

The first log message indicates that the node was not responding.
The first log message is a fatal error indicating that the program detected a problem and aborted.
The client sent a request to the server, but the server did not respond.
The first log message shows that the *** task took *** ms to complete.*

There are several possible reasons for these suboptimal outcomes. One possibility is that there simply aren't enough examples of that type of issue in the public domain, so GPT-3 is responding with the closest details it can find. Another is that we haven't yet explored all the variants of prompts and options we can use with the GPT-3 model.

The good news is that even when results are suboptimal, they are mostly not misleading and are easily ignored. More importantly, our ML-generated root cause summaries are the perfect input source for GPT-3, and with more work, the outcomes will only get better from here.

Now for the good part

The great news is that it actually works well more often than not, and the results are actually quite useful. Here are some examples where the GPT-3 summary really described the event collection accurately, and was really helpful to the user to quickly digest the RCA. Note: we have obfuscated details that might be potentially sensitive, and we're not sharing the raw log events for the same reason, although they would be useful to compare alongside the summaries.

As a first bucket, here are some interesting and useful incident summaries related to memory starvation:

The **** daemon was unable to allocate memory for the **** data.
The RabbitMQ broker was running out of memory.
The **** process was killed by the OOM killer.
The application was trying to fork a child process, but the fork call failed because the system was out of memory.
The memcached process was killed by the OOM killer because it was consuming too much memory. The OOM killer is a Linux kernel feature that allows the system to kill a process when the system is running out of memory.
The memory cgroup was out of memory, so the kernel killed process **** and its child ****.

Then, here are some other infrastructure related incidents:

The AWS Machine Image AMI that the customer used was for an x86_64 architecture, but the customer s instance type was arm64.
The file system was corrupted.
The Kafka producer was not configured to be a leader for the topic-partition.
The root cause of the issue was that the Jenkins master was not able to connect to the vCenter server.
The root cause was a drive failed error
The cluster was under heavy load, and the scheduler was unable to schedule the pod.
The pod was scheduled on a node that had insufficient CPU.
The root cause was that the Slack API was rate limited.

For variety, here are some database related incidents:

The database was closed, and the query failed.
The EC2 instance was running out of connections to the database.
The database driver was unable to ping the database server.
The first message is a SQL error, which means that the database server was unable to execute the query.

Finally, here are some examples of security related incident summaries:

The LDAP server was not responding to the client.
The root cause of the issue was that the certificate chain did not match any of the trust anchors.
The root cause of the problem was that the sshd daemon on the server was configured to allow only three authentication attempts per IP address.
The server rejected the connection because it has already seen too many invalid authentication attempts for that user.

Summary

Our focus is to cut troubleshooting time using machine learning to summarize the key event sequences that describe an incident based on logs and associated metrics anomalies. The GPT-3 integration is a big step towards our goals -- enabling quick review of RCA reports by anyone, even personnel who may not be intimately familiar with application internals. As described above -- there are still improvements to be made, but it works so well in real world scenarios that we are now opening it up to all our users.

Try it for yourself by signing up for a free trial.

Posted with permission of the author Ajay Singh @ Zebrium

Lessons from Slack, GCP and Snowflake outages

gdcohen — Thu, 04 Feb 2021 22:51:31 +0000

An outage at a market leading SaaS company is always noteworthy. Thousands of organizations and millions of users are so reliant on these services that a widespread outage feels as surprising and disruptive as a regional power outage. But the analogy with an electricity provider is unfair. While we expect a utility to be safe, boring and just plain reliable, we expect SaaS services to also innovate relentlessly. As we pointed out a while back, this is the crux of the problem. Although they employ some of the best engineers, sophisticated observability strategies and cutting-edge DevOps practices, SaaS companies also have to deal with ever accelerating change and growing complexity. And that occasionally defeats every effort to identify and resolve a problem quickly.

For example, Snowflake had a major outage six weeks ago. Users were unable to log into the UI or found it unresponsive for about 4 hours. The problem started because a web server running an older version of the OS ran out of disk space. After RCA was determined, alert rules were updated. But this is a classic illustration of why alert rules can't possibly keep up with all new failure modes, and why you need ML to root cause such failure modes rapidly. This specific example is one that ML has repeatedly proven to handle easily (see here and here).

Another recent outage impacted users of GCP, Big Query, GKE and other Google services for almost an hour. The root cause was a quota bug introduced when Google switched to new quota management service, and the absence of adequate logic to catch this new failure mode.

Slack had another example -- a 4 hour outage that started with network disruption (unusually high packet loss), but was really exacerbated when the provisioning service could not keep up with demand and started adding improperly configured servers to the fleet. To add insult to injury, the observability stack itself was unreachable due to the network disruptions. After RCA a swath of corrective actions were put in place, including new run books, new alert rules for network disruptions, design changes, a way to bypass the observability stack etc. But the crux of the issue remains - the provisioning service failure is another example of an unanticipated (new) failure mode that took a long time to resolve.

Expecting humans to anticipate the unknown and figure out what happened was hard enough in the old days of simple monolithic architectures, monthly software releases, formal test plans and extensive QA. It is entirely unreasonable in the world of hundreds of intertwined microservices, multiple daily deployments and "test in production". What is most interesting about the examples above, is that in each case the vendor quickly knew there was a problem, but it took a long time and a lot of hunting to figure out what caused the problem (the "root cause"). A growing number of organizations are now realizing that the only way to root cause these problems much faster (or even proactively detect them) is to employ ML to identify new failure modes and their root cause. ML in the observability domain has come a long way from the early attempts at anomaly detection and AIOps (read more about this here).

We encourage you try it for yourself by breaking things and examining the ML generated RCA reports. Don't have an environment you can easily do this? Try it with this simple K8s based demo app.

Posted with permission of the author - Ajay Singh, CEO @ Zebrium

Testing ML incident detection using a cloud native microservices app

gdcohen — Thu, 17 Dec 2020 00:31:32 +0000

There is no better way to try Zebrium machine learning incident detection than with a production application that is experiencing a problem. The machine learning will not only detect the problem, but also show its root cause. But no user wants to induce a problem in their app just to experience the magic of our technology! So, although it's second best, an alternative is to try Zebrium with a sample real-life application, break the app and then see what Zebrium detects. One of our customers kindly introduced us to Google's microservices demo app - Online Boutique.

This blog will show you how to install and break the sample app using a local minkube Kubernetes cluster running on your laptop. The entire process, including installing Istio, Promethues, Kiali, Online Boutique, plus signing up for a Zebrium account and installing Zebrium log and metrics collectors, takes 20-30 minutes.

Important: Before starting, you will need to install minikube (instructions for Linux, MacOS and Windows here). You'll also need to install git, helm and curl (Google "curl" for your platform) if you don't already have them.

1. Sign-up for a Zebrium account

Now let's get going with your Zebrium account! You can sign-up for a new account here.

Once you've entered your details and set a password, you will see the Zebrium Setup page.

2. Because this is a demo environment, adjust some default Zebrium settings

The default settings of the Zebrium platform work well for most production environments. However, for the purpose of this demo, we will compensate for the short run time and small amount of data by changing a few default settings.

Set incident sensitivity to high and enable infrastructure incidents:

In the top RHS of the Setup page in Zebrium UI (see picture above), click the gear button (1) and select Incidents (2) from the dropdown. Now click "Create" (3) under Infrastructure Incidents (this will allow certain types of K8s infrastructure logs to be included for incident detection) and select "high" (4) under Incident Sensitivity.

Change the refractory period

When you break the demo app (see later), it will generate a lot of similar log events and patterns to ones that occurred during the bring-up of your demo environment. For this reason, we will change the default refractory period so that the ML will know it's ok to create an incident even if something similar has already happened recently.

See picture above - in the top RHS, click the gear button (1) and select Advanced (2) from the dropdown. Set Refractory Period to 10 minutes (3). Finally click the Ze icon (4) in the top LHS to go back to the setup page.

It's important to note that the above settings are needed to compensate for the short run time and small amount of data in this demo setup. For normal use, you do not need to change these settings.

2. Start minikube with enough resources

Note the -p option for all minikube commands. This is because we will use a separate minikube instance named "boutique" (this will make it easier to clean-up when you're done).

minikube start --cpus=4 --memory 4096 --disk-size 32g -p boutique

In order to make the frontend IP address of the Online Boutique app accessible (needed later), you will need to run the "minikube tunnel command". It must be run in a separate terminal window.

# Make sure you run this command in a different window
minikube tunnel -p boutique

3. Install the Zebrium log and metrics collectors

Go the the Zebrium Setup page in your browser:

Start by clicking on "Kubernetes" under Log Collector Setup. This will produce a popup similar to the picture below. You should select to install with "Helm v3":

Now copy and paste the install commands from the Zebrium UI. When installing zlog-collector, set "zebrium.deployment" to a name like "boutique" and delete the part of the line that sets zebrium.timezone. See the example below (make sure you use the token from your own Zebrium UI):

# Install the Zebrium log collector by copying and pasting commands from the Zebrium UI.
kubectl create namespace zebrium
helm install zlog-collector zlog-collector --namespace zebrium --repo https://raw.githubusercontent.com/zebrium/ze-kubernetes-collector/master/charts --set zebrium.collectorUrl=https://zapi03.zebrium.com,zebrium.authToken=XXXXX,zebrium.deployment=boutique

Now install the Zebrium metrics collector by clicking on the from the Kubernetes button under "Metrics Collector Setup" in the Zebrium Setup UI. Once again use the Helm v3 method and cut and paste the commands from the UI popup. Note that when executing the install command for zstats-collector, use the same value that you used above ("boutique") for "zebrium.deployment":

# Install the Zebrium metrics collector by copying and pasting commands from the Zebrium UI.
helm repo add stable https://charts.helm.sh/stable
helm repo update
helm install node-exporter --namespace zebrium stable/prometheus-node-exporter
helm install zstats-collector zstats --namespace zebrium --repo https://raw.githubusercontent.com/zebrium/ze-stats/master/charts --set zebrium.collectorUrl=https://zapi03.zebrium.com/stats/api/v1/zstats,zebrium.authToken=XXXX,zebrium.deployment=boutique

The Zebrium ML will begin receiving and structuring logs and metrics from your newly created K8s environment.

4. Install Istio, Prometheus and Kiali

More detailed instructions for installing Istio service mesh can be found here. Istio and Prometheus aren't actually needed for the demo app, but it allows the use of Kiali which will give you a really nice graphical view of the environment!

First download Istio:

# Make a directory for this environment
mkdir onlineboutique
cd onlineboutique

# Get the latest version of Istio
curl -L https://istio.io/downloadIstio | sh -

# Check the name of the Istio directory that was created
ls

# go into Istio directory (name in ls output)
cd istio-1.8.0

Now install Istio and Prometheus:

#install istio. Note: if on a Mac and you get a message about istioctl being from an unidentified developer, see the note above.
kubectl create namespace istio-system
export PATH=$PWD/bin:$PATH
istioctl install --set profile=demo -y
kubectl label namespace default istio-injection=enabled

# Install Prometheus
kubectl apply -f ./samples/addons/prometheus.yaml

Now you're ready to Install and bring up kiali

# Make sure you are still in the Istio directory from the steps above
kubectl apply -f ./samples/addons/kiali.yaml

Important - you might see a bunch of errors saying something like: "unable to recognize...". If so, this is a known bug. To fix this, run the apply command again and you should see a few "...created" messages:

kubectl apply -f ./samples/addons/kiali.yaml

Verify that everything is running:

# Verify that Istio, Prometheus and Kiali pods are running. You should see something similar to below:
kubectl get pods -n istio-system
NAME                                    READY   STATUS    RESTARTS   AGE
istio-egressgateway-d84f95b69-zghjf     1/1     Running   0          20m
istio-ingressgateway-75f6d79f48-zcpk2   1/1     Running   0          20m
istiod-c9f6864c4-q68bj                  1/1     Running   0          21m
kiali-7476977cf9-jkz6b                  1/1     Running   0          15m
prometheus-7bfddb8dbf-8sg46             2/2     Running   0          19m

You can now bring up the Kiali UI. It will appear in a new tab in your browser.

# Bring up the Kiali UI (this will open the UI in a new browser tab)
istioctl dashboard kiali &

5. Time to install and fire up the Online Boutique app

The app has 12 services (adservice, cartservice, checkoutservice, currencyservice, emailservice, frontend, loadgenerator, paymentservice,productcatalogservice, recommendationservice, redis-cart and shippingservice) and will take a few minutes to start up. While starting up, you might see some of the pods enter Error/CrashLoopBackOff states a few times. Make sure you wait until they are all in a Running state.

# Go back to the directory you created above (onlineboutique)
cd ..

# Clone the Online Boutique repository
git clone https://github.com/GoogleCloudPlatform/microservices-demo.git
cd microservices-demo

# Install the app.
kubectl apply -f ./release/kubernetes-manifests.yaml

# Check to see if everything has started - this takes a few minutes. Keep checking and don't move on until all pods are in a running state
kubectl get pods
adservice-5f6f7c76f5-mnn2v               2/2     Running   0          4m18s
cartservice-675b6659c8-nzrnb             2/2     Running   2          4m19s
checkoutservice-85d4b74f95-jm4z8         2/2     Running   0          4m20s
currencyservice-6d7f8fc9fc-l74nc         2/2     Running   0          4m19s
emailservice-798f4f5575-b72s6            2/2     Running   0          4m20s
frontend-6b64dc9665-g22mp                2/2     Running   0          4m19s
loadgenerator-7747b67b5-8946m            2/2     Running   4          4m19s
paymentservice-98cb47fff-rxqjm           2/2     Running   0          4m19s
productcatalogservice-7f857c47f-kml88    2/2     Running   0          4m19s
recommendationservice-5bf5bcbbdf-9g5l2   2/2     Running   0          4m20s
redis-cart-74594bd569-vbx5h              2/2     Running   0          4m18s
shippingservice-75f7f9dc6c-sfczx         2/2     Running   0          4m18s

Once all the services are Running, you can bring up the app in your browser. You will need to get the frontend IP address by running the command below (make sure that you didn't forget the "minikube tunnel" command in step 1 above or this won't work).

#get IP address for boutique and then open EXTERNAL-IP in a browser tab (sample output below)
kubectl get service/frontend-external
NAME                TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
frontend-external   LoadBalancer   10.99.208.30   10.99.208.30   80:32326/TCP   6h8m

Now open the EXTERNAL-IP address in a new browser tab! And you should see the online boutique app:

In the Kiali tab in your browser, click Graph. In the Display dropdown, select "Traffic animation". You should see something like the picture below:

Important: Now go and get a cup of your favorite beverage and come back in 10 minutes. Yes I'm serious! This will give the Zebrium ML a chance to learn the structures and patterns that occur under normal running conditions.

6. Break the Online Boutique app

In the Kiali dashboard, you can see that a lot of traffic moves through the "productcatalogservice" (select Traffic Animation under the Display dropdown in the Kiali graph dashboard). So let's kill the productcatalogservice pod!

# Kill the productservicecatalog pod by scaling it to zero
kubectl scale deploy productcatalogservice --replicas=0; date

Note down the time from the output of the "date" command in the step above. Go to your browser and you should see that the app no longer works and the Kiali dashboard should show a lot of red:

7. The results

Since this is a brand new Zebrium instance, things can take a bit longer than usual and so it could take 10 minutes or more before Zebrium detects the problem. Also, since there are many new/rare patterns in the logs, and because incident sensitivity is set to high, you will likely get a bunch of new incidents even though not all of them are for real problems.

Also, you might notice that when the relevant incident is first created, it is incomplete (it might not be as detailed as the example below). Give it some time and you might see that the detail of the incident improves. This is because the machine learning will continue to refine its model over the next few hours.

When Zebrium does detect the incident, you will get a Slack alert (you should have a received an email to join the Zebriumcommunity Slack workspace). You can also click on the incidents tab in the Zebrium UI at any time to see a list of incidents that have been detected. This is what my environment looks like:

The incident with the red box is the one that that we induced. Here's how to understand the incident list (see picture above):

1 - shows time of the incident
2 - shows hosts and logs that the incident spans
3 - shows the First event in the incident. This often gives a clue of the root cause.
4 - shows the Worst event in the incident. This is usually the event that a human would think of as being the most serious event in the incident.

Now, to see details of what was detected, click on "INCIDENT REPORT" and you should see something like this:

1 - shows the events that make up this incident. They tell the story of what happened. Note in particular this one which tells us the root cause "Deleted pod: productcatalogservice-7f857c47f-n9cxn":

2 - shows related metrics anomalies. You can see that everything suddenly drops at the same time the pod was deleted.
3 - shows a timeline of the incident. You can click on any of the dots to go to that particular event (this is very useful if you turn the filter off - see below)
4 - is the Show Nearby button. It will bring in additional anomalies and errors that our ML has detected nearby. This often helps to provide more detail on the incident.
5 - is the Filter button for the incident. Click the green filter button to turn off incident filtering. This will show you all the log events around your current position.

8. Optional clean-up of the minikube K8s instance

When you're done with testing the microservices app, you can delete the entire minikube K8s cluster with the following commands. Warning, you can't undo this step!

# Stop and delete the minikube K8s cluster - WARNING: you can't undo this!
minikube stop -p boutique
minikube delete -p boutique

Summary

The above steps provide an easy way to test Zebrium's machine learning technology - just fire up a demo app, break the app and then see how Zebrium detects the problem and its root cause. But don't get lost in the weeds! The most important thing to remember is that the problem was detected by our machine learning without any prior understanding of your environment, and with absolutely no human built rules.

The Zebrium ML technology works by learning the structures and patterns in your logs and metrics. It then finds incidents by looking for hotspot of abnormally correlated anomalous patterns across your logs and metrics. More detail about how it works can can be found here.

We encourage you to continue exploring the Zebrium platform with the demo environment you have built. But really the best way to see the magic of Zebrium is to try it with your real application - you'll be amazed at what it finds!

Virtual tracing: A simpler alternative to distributed tracing for troubleshooting

gdcohen — Thu, 23 Jul 2020 18:48:21 +0000

The promise of tracing

Distributed tracing is commonly used in Application Performance Monitoring (APM) to monitor and manage application performance, giving a view into what parts of a transaction call chain are slowest. It is a powerful tool for monitoring call completion times and examining particular requests and transactions.

Quite beyond APM, it seems natural to expect tracing to yield a 'troubleshooting tool to rule them all'. The detail and semantic locality of the trace, coupled with the depth of the service call graph, generate this expectation. It's obvious that correlating across services gives diagnostic power. The typical traceview screams the sort of "first this, then that" narrative RCAs are made of.

In reality, though, users have seen mixed results. According to a prospect:

"We paid six figures for a contract, to reduce MTTR significantly. We've given up on that. We're writing it off. Engineers had to do extra work to implement it. Operations had to do extra work to set and respond to alerts. The alerts worked well but, in the end, finding root-cause was still slow. It just wasn't worth it."

This is not isolated feedback.

"Tell me where to focus"

There are two issues raised: work required to yield results, and inadequacy of those results. The first issue is easy to understand: depending on the stack, the application, the application's evolution, the deployment mechanisms and so on, it may indeed be a lot of work to generate useful traces. This problem is not surprising and can be surmounted; it is worth taking away that zero-configuration, zero-instrumentation (i.e., autonomous) solutions are of great value.

The larger issue is around applicability for root-cause detection. Here, respected Observability author and blogger Cindy Sridharan has some insightful things to say. I'll lift a quote from her article which I highly recommend:

"What's ideally required at the time of debugging is a tool that'll help reduce the search space... Instead of seeing an entire trace, what I really want to be seeing is a portion of the trace where something interesting or unusual is happening... dynamically generated service topology views based on specific attributes like error rate or response time..."

We need a tool that takes a stab at incident and, if possible, root-cause detection, or highlighting. Stepping back further, I would posit that trace data has become a bit like log data: a treasure trove of untapped information for root-cause detection, in-part because there is just too much detail to sift through without help.

Virtual tracing alone

Tracing is so promising for troubleshooting because of semantic locality. Tracing works the way it does because span ids are passed around the stack - in headers or log lines, for example. These spans are then associated together into a trace, and so we know all the spans in the trace are at least to some extent "about" the same thing - a transaction or request, for example.

There are other ways to establish semantic locality. Looking again at a trace, we might note that its spans demonstrate temporal locality, as well. We might suppose that a majority of service-impacting incidents also have temporal locality: if Service A fails, and Service B calls Service A, then Service B is likely also to fail, soon after Service A does. Our job will be to determine spans of telemetric data that share semantic locality.

Could we suppose that events happening nearby in time are semantically related? Of course not... but, we could look for features in ordinary telemetry - "rareness" or "badness", for example - and model inter-occurrence intervals of such features across services. Rareness might be indicated by rare log events from a given container, for example, or a multi-hour peak in a metric; badness might be indicated by a flurry of errors from another container, or a host log.

Machine learning could observe the ordinary behavior of the system to estimate parameters for our model, and then use the model to hypothesize which features ARE semantically related, with high probability. Related features would in this way correspond to a virtual trace; each virtual span would map directly to a timespan of telemetric data capture for a single generator - host, container, or service, for example.

A virtual tracing example

A database server is inadvertently shut down. A number of rare events are emitted into its log stream; a rare spike in available memory occurs on the host(s). A few seconds later, a flurry of unusual errors are emitted in the log stream of a service reliant on the database because it cannot connect to the database.

Say, rare events in the database log stream happen at random about once per day, on average; rare errors in the consumer's log stream happen about once every hour; comparable spikes in available memory happen twice a day. But now, all three of these rare things happened within 3 seconds.

The ML model decides these all should be part of a virtual trace, as a result; we construct spans of related activity on each generator - contiguous timespans, in fact - and bundle them up. If there was enough badness, we notify; we fingerprint to keep track of separate "trace types". In this way we've achieved the goals of autonomous incident and root-cause detection; we can present only the data and services where attention should be focused.

Virtual and instrumented tracing combined

Combining these approaches yields real opportunities for improving the user experience. The instrumented trace augments the virtual one. For example, the service graph can be used to exclude virtual spans from a virtual trace based on implausibility of the related generator causing the observed badness; the full trace can give deep performance context to the virtual trace and serve as a launching-off point to navigate to other services and/or similar traces. Instrumented tracing brings precision, depth, and broad context to a virtual trace.

Similarly, the virtual trace augments the instrumented one. We can hone or auto-tune our alerting based on badness seen in the virtual trace; we can hone the traceview to just those services and components touching the virtual trace, and then allow the user to expand outward, if need be. Virtual tracing brings autonomy, incident detection, and root-cause indication to an instrumented trace.

Zebrium has built an implementation of virtual tracing into its autonomous log management and monitoring platform. You can read more about it here.

Posted with permission of the author: Larry Lancaster

Augment a PagerDuty Incident with Root Cause

gdcohen — Fri, 17 Jul 2020 19:16:08 +0000

I wanted to give you an update on my last blog on MTTR by showing you our PagerDuty Integration in action.

As I said before, you probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Who of us hasn't been ripped out of bed or torn away from family or friends at the most inopportune times? I know firsthand from running world-class support and SRE organizations that it all boils down to two simple things: 1) all software systems have bugs and 2) It's all about how you respond. While some customers may be sympathetic to #1, without exception, all of them still expect early detection, acknowledgement of the issue and near-immediate resolution. Oh, and it better not ever happen again!

Fast MTTD. Check!

PagerDuty is clearly a leader in Incident Response and on-call Escalation Management. There are over 300 integrations for PagerDuty to analyze digital signals from virtually any software-enabled system to detect and pinpoint issues across your ecosystem. When an Incident is created through one of these monitoring integrations, say an APM tool, PagerDuty will mobilize the right team in seconds (read: this is when you get the "page" during your daughter's 5th birthday party). Fast MTTD. Check!

But what about MTTR? By R, I mean Resolve

All great incident response tools will have workflow and run book automation mechanisms that can help restore system operation quickly in some typical cases. But this doesn't get to the root cause so you can understand and resolve the issue and not let it happen again. For that, enter the all-too-common "War Room". That term and concept was first coined in 1901 but was probably made most famous by Winston Churchill during WWII where the Cabinet War Room was the epicenter of intelligence gathering, data analysis and communications throughout the war. Back then it was the telegraph, phones, radio signals and maps on the wall. Today it's likely a virtual room using Zoom, Slack, real-time visualizations, cell phones, and most important, logs. Millions and millions of logs! But, the prevailing attitude of the War Room has remained unchanged - "If you're going through hell, keep going".

The initial signal that triggered the incident was likely from an alert which perhaps detected that a predefined threshold was "out of tolerance", how ever simple or complex that may be. Or perhaps from an alert you defined in a ping-tool or maybe some home grown live tail watching for spikes in error counts or some similar patterns. Whatever the means, it was based on some predefined rule(s) to detect symptoms of a problem. In most cases, the next step is determining the root cause. For example, the symptom that triggered the signal might have been that latency was too high, but this tells you nothing about the root cause.

Whether you formalize a designated War Room for a particular incident or not, two things are certain: 1) timely, thorough and accurate communication between team members is paramount and 2) You're likely going to search through logs and drill down on various metrics to get to the root cause. And this brute force searching through logs and metrics is probably your hell.

Stop going through Hell

Many monitoring tools have started to utilize various machine learning and anomaly detection techniques to raise that first signal to trigger an incident response, often via tools like PagerDuty. However, these techniques still require too much human input to handpick which metrics to monitor, and to choose specific algorithms or tuning parameters. Anomaly detection in monitoring tools is predominately geared towards time-series data and rarely for logs. Yet logs are indispensable for root case analysis. This makes these tools blind to the root cause of any issue and will ultimately require time consuming drill-down and hunting through the logs. Millions and millions of logs!

By contrast, Zebrium's machine learning detects correlated anomalies and patterns in both logs and metrics and uses them to automatically catch and characterize critical incidents and show you root cause (see - The Anomaly Detection You Actually Wanted. This means faster MTTR and no more hunting for root cause! We call this Incident Recognition and it's part of our Autonomous Monitoring platform.

Now, let me show you how you can tie your existing incident management workflow together with automatic root cause identification, regardless of the triggering signal -- for fast incident resolution.

Augmenting detection with Zebrium automatic root cause identification

Zebrium uses unsupervised machine learning to automatically detect and correlate anomalies and patterns across both logs and metrics. These signals form the basis for automated Incident Detection and Root Cause identification.\
In addition to autonomous monitoring, we can also consume external signals to inform our Incident Detection. Imagine any one of your monitoring tools in your PagerDuty ecosystem has created an Incident and you get "The Call" (at one of those inopportune moments). What happens? Well you probably already know the pain that will lie ahead. But what if instead you looked at the PagerDuty Incident or your Slack Channel and a full report of the anomalous logs and metrics surrounding the incident -- including the root cause - was already there, at your fingertips.

Walkthrough: An APM triggered PagerDuty Incident gets augmented with root cause

Here's how it works:

Your AppDynamics APM tool detects a Critical Health Rule Violation: Login Time Exceeds 60 seconds (you can see that in the PagerDuty Incident below) and sets everything in motion.
Through an existing integration with PagerDuty, an incident is created and the escalation policy fires (the war room is now open) and you see this in PagerDuty:

AppDynamics Triggered an Incident in PagerDuty\
If you also have a PagerDuty/Slack integration, you'll see something like this:

PagerDuty Notifies Slack with your AppDynamics Incident
At that same instant, PagerDuty automatically sends a signal to Zebrium with all the incident details.
At this point, Zebrium does three things:

The first thing Zebrium does is correlate the PagerDuty incident details with its Autonomous Incident Detection and Root Cause by looking across logs and metrics over the past half hour for any Incidents it has already detected.
And in this case, Zebrium has detected what looks to be a relevant incident. The PagerDuty incident is updated with the Zebrium Incident details and likely root cause via the PagerDuty API. Here's what that PagerDuty update looks like in Slack:

Zebrium detected an Incident and Root Case and has updated PagerDuty and Slack
If you need to drill down further, it's just one click from either the Slack channel or your PagerDuty Incident. Lets take a look...

Zebrium Incident Drill Down

\
Looking at our Zebrium Incident, your attention is immediately drawn to the Hallmark event in red. This is what we believe is the most relevant and important anomalous event in the Incident. When we look closer, we see that Java "Cannot create a session". That seems very closely related to our APM alert "Login Time Exceeds 60 seconds".

At the top of the incident we also see a correlated metric anomaly in the JVM Pool. In fact we see that drop a couple times. We might think that's the issue. But it's not a root cause.

We typically would see a root cause occurring near the beginning of a Zebrium Incident timeline. So looking up the list of events, we see the Kernel invoking the OOM-KILLER on a process called oom_test. And this is in fact the root cause. We had started oom_test to continue to consume memory until killed.

The Zerbium Autonomous Monitoring Platform identified this Incident and Root Cause completely unsupervised. There were no predefined alert rules and there was no human intervention whatsoever (other than starting the oom_test program).
The second a really cool thing Zebrium does, is create a Synthetic Incident. While it's very likely Zebrium has already detected the Incident and Root Cause automatically, we will additionally take the Signal from the APM Incident and create a new Zebrium Incident with any further anomalous events or metrics around the time of the signal to make sure this information is also easily at-hand. This often proves very useful to the person doing the troubleshooting. And indeed, you can see that happened and we've added a note to the PagerDuty Incident and Slack channel.

Zebrium Synthetic Incident Created and Updated in PagerDuty
Finally, the third thing we'll do is keep an eye on things for the next thirty minutes and continue to update the PagerDuty Incident with any new Incidents we identify.

So lets recap the timeline from the PagerDuty Incident shown below.

09:55 - AppDynamics detected a Health Rule Violation

09:55 - Seconds later, PagerDuty creates the Incident and signals Zebrium right away

09:55 - Less than a minute later, Zebrium has updated the PagerDuty Incident with a link to the Zebrium Incident and ultimately the Root Cause that had already been identified in the past half hour.

10:07 - Zebrium creates a Synthetic Incident with additional details and updates the PagerDuty Incident.

10:25 - Zebrium continues to watch for additional incdents for 30 minutes after the signal is received.

Summary of the Overall Workflow

Let Zebrium take care of MTTR

Using PagerDuty, Zebrium can now augment existing incidents that have been detected by any 3rd-party tool. In doing so, your incident will be automatically updated with details of root cause without all the hunting, scrambling and adrenaline that is normally associated with a war room!

You can get started for free by visiting either https://www.zebrium.com or https://www.pagerduty.com/integrations/zebrium.

Now that's MTTR!

Posted with permission of the author: Rod Bagg

This Slack App Speeds-up Incident Resolution Using ML

gdcohen — Wed, 08 Jul 2020 18:37:05 +0000

If your team (like many others) uses Slack to collaborate during incident management and triage, this new Slack app will be a big-time saver.

Let's say a monitoring tool in your environment triggered an incident, and now you're working with colleagues to figure out what went wrong and how to resolve it. You start digging into logs and metrics, looking for anything unusual, with only the incident symptoms and your hunch to guide your searches. Incident response is a big and growing pain point for teams managing modern applications -- contributing to lost productivity, stress and a big economic cost overall.

Zebrium's machine learning platform consumes a feed of raw logs and metrics, and automatically learns data structures, event types and normal patterns. It automatically detects anomalies, and uses a virtual tracing technique to identify correlations between anomalies that characterize real incidents. This lets Zebrium automatically detect incidents. But for the moment let's continue with the scenario described above -- you already have an incident identified based on alerts from some other tool. Well Zebrium can also consume an external incident signal to inform and correlate its incident detection and root cause. To do this, just enable the slack app in your workspace, configure the Zebrium data collectors (one helm command in k8s), and type one command in your slack channel (or get your Bot to do it) to ask Zebrium for help.

Zebrium will pull together an incident report describing the sequence of anomalous events, related anomalous metrics, and the services and nodes participating in the incident, and the worst symptom of the incident -- combined this forms a "virtual trace". Typically, the report will give you enough detail and context to identify root cause. And when needed, it also enables near instant drill down to examine the surrounding events and metrics. Because the "virtual trace" has already narrowed down the context and time range, this helps you get to resolution much faster than a hunch and blind searching.

Here's how it works

1 - Setup Zebrium data collectors -- takes 2 minutes and a single helm command for k8s environments, or a couple of steps for other environments.

2 - Install the Slack app for your workspace. You can do this from Zebrium settings.

3 - Now let's say you setup a slack channel for your Virtual War Room. You get the team together and you're looking at data from the APM tool and see some troubling stat.

4 - Type "/zebrium incident analyze (with the option to specify the incident time)" to call on Zebrium for analysis and root cause.

5 - Simply click and see full incident details.\

Let Zebrium take care of MTTR

Using the Zebrium Slack App, you can now leverage Zebrium ML to augment incidents that have been detected by any tool. In doing so, our virtual tracing will automatically identify the impacted services and nodes, and pull in the sequence of anomalous events and metrics that best describe the incident root cause. And you'll get this without all the hunting, scrambling and adrenaline that is normally associated with a war room. Now that's MTTR!

You can learn more or get started for free by visiting https://www.zebrium.com.

Posted with permission of the author: Ajay Singh

Zebrium + Grafana = Awesome

gdcohen — Fri, 19 Jun 2020 22:15:19 +0000

Lots of people like to construct dashboards in Grafana, for monitoring and alerting -- it's fast, sleek, and practical. Zebrium is awesome for analytics in-part because we lay everything down into tables in a scale-out MPP relational column store at ingest. Each event type gets its own table with typed columns for the parameters; metrics data is also tabled; ditto for anomalies and incidents.

In our next release, we're rolling out support for Grafana, with Zebrium functioning as a data source. The basic architecture looks like this:

Zebrium's strong data discipline means it can be used as a rich and practical data source for monitoring. We can easily group / join / analyze data through functions and SQL views, and expose such views to Grafana.

As an example, we =, rolled up to the minute and normalized, and call it cpu_basic. It might have these columns, among others:


select ts,host,util from p03_views.cpu_basic limit 10;

ts          |      host       |  util
---------------------+-----------------+-------
 2020-06-18 09:43:00 | ip-172-31-55-34 |  2.44
 2020-06-18 19:54:00 | ip-172-31-55-34 |  1.83
 2020-06-16 19:33:00 | ip-172-31-62-10 |   1.8
 2020-06-18 02:25:00 | ip-172-31-55-34 |  6.07
 2020-06-17 18:16:00 | ip-172-31-62-10 |  3.65
 2020-06-18 11:18:00 | ip-172-31-62-10 |  7.92
 2020-06-16 20:04:00 | ip-172-31-62-10 |  0.48
 2020-06-17 07:05:00 | ip-172-31-62-10 |  5.04
 2020-06-17 08:54:00 | ip-172-31-55-34 | 17.12
 2020-06-18 13:43:00 | ip-172-31-62-10 |  6.95
(10 rows)

Similarly, we might create a simple view in Zebrium to monitor error counts, rolled up to the minute, and call it errors_basic. It might have these columns, among others:

select ts,host,errors from p03_views.errors_basic limit 10;

         ts          |      host       | errors
---------------------+-----------------+--------
 2020-06-18 23:19:00 | ip-172-31-55-34 |      7
 2020-06-18 07:01:00 | ip-172-31-55-34 |      7
 2020-06-18 04:21:00 | ip-172-31-55-34 |      7
 2020-06-15 00:36:00 | ip-172-31-62-10 |     14
 2020-06-17 19:49:00 | ip-172-31-62-10 |     15
 2020-06-15 17:47:00 | ip-172-31-55-34 |      7
 2020-06-16 05:53:00 | ip-172-31-62-10 |     15
 2020-06-18 21:46:00 | ip-172-31-62-10 |     15
 2020-06-16 21:31:00 | ip-172-31-62-10 |     14
 2020-06-17 11:03:00 | ip-172-31-55-34 |      7
(10 rows)

We could define a Grafana variable, host, with the query:

SELECT DISTINCT host FROM p03_views.cpu

and in a Grafana panel definition for cpu_basic, we could then use the query:

SELECT    $__time(ts),host,util
FROM      p03_views.cpu_basic
WHERE     $__timeFilter(ts)
          AND host IN (${host:sqlstring})
ORDER BY  host,ts

Like so:

Doing similarly with our errors_basic view and placing both the panels on a dashboard along with a variable multi-selector, we see the beautiful Grafana magic come to life:

where we can zoom in to look at a particular host at a particular point in time, everything functioning the way it usually would in Grafana:

In a follow-on blog, we'll take a closer look at some more complex example views, as well as some more advanced analytics.

Posted with permission of the author:
Larry Lancaster @ Zebrium

You've Nailed Incident detection, what about Incident Resolution?

gdcohen — Fri, 05 Jun 2020 01:35:16 +0000

If you're reading this, you probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Who of us hasn't been ripped out of bed or torn away from family or friends at the most inopportune times? I know firsthand from running world-class support and SRE organizations that it all boils down to two simple things: 1) all software systems have bugs and 2) It's all about how you respond. While some customers may be sympathetic to #1, without exception, all of them still expect early detection, acknowledgement of the issue and near-immediate resolution. Oh, and it better not ever happen again!

Fast MTTD. Check!

But what about MTTR? By R, I mean Resolve

The initial signal that triggered the incident was likely from an alert which perhaps detected that a predefined threshold was "out of tolerance" however simple or complex that may be. Or perhaps from an alert you defined in a ping-tool or maybe some home grown live tail watching for spikes in error counts or some similar patterns. Whatever the means, it was based on some predefined rule(s) to detect symptoms of a problem. In most cases, the next step is determining the root cause. For example, the symptom that triggered the signal might have been that latency was too high, but this tells you nothing about the root cause.

Stop going through Hell

Now, let me show you how you can tie your existing incident management workflow together with automatic root cause identification, regardless of the triggering signal -- for fast incident resolution.

Augmenting detection with Zebrium automatic root cause identification

In addition to autonomous monitoring, we can also consume external signals to inform our Incident Detection. Imagine any one of your monitoring tools in your PagerDuty ecosystem has created an Incident and you get "The Call" (at one of those inopportune moments). What happens? Well you probably already know the pain that will lie ahead. But what if instead you looked at the PagerDuty Incident and a full report of the anomalous logs and metrics surrounding the incident -- including the root cause - was already there, at your fingertips.

Part 1 - If you use PagerDuty

Here's how it works:

Your monitoring tool raises an alarm
Through an existing integration with PagerDuty, an incident is created and the escalation policy fires (the war room is now open)
At that same instant, PagerDuty automatically calls an outbound webhook to Zebrium with all the incident details it has.
Zebrium correlates those incident details with its Autonomous Incident Detection and Root Cause by looking across logs and metrics
The PagerDuty incident is updated with Zebrium Incident details and likely root cause via the PagerDuty API
If you need to drill down further, it's just one click from your PagerDuty Incident.

Part 2 - If you use Slack as your incident management workspace

Zebrium's Autonomous Monitoring platform can consume any external signal to inform and correlate incident detection and root cause. Let's say for example, you're working with colleagues and communicating in a Slack channel that has Zebrium's integration enabled in the workspace. Just type a command (or get your Bot to do it) to ask Zebrium for help and we'll pull together all the relevant anomalies, logs, metrics and provide near instant drill down capabilities to get you to resolution fast! All from Slack.

Here's how it works:

You setup a slack channel for your Virtual War Room. You get the team together and you're looking at data from the APM tool and see some troubling stat
You type "/zebrium incident analyze" to call on Zebrium for analysis and root cause
Here it is... no more pain!

1 - The War Room

2 - Zebrium slash command in action

3 - Zebrium Incident UI showing root cause and correlated metrics anomalies

Let Zebrium take care of MTTR

Using the above integrations, you can now augment existing incidents that have been detected by any tool with Zebrium. In doing so, your incident will be automatically augmented with details of root cause without all the hunting, scrambling and adrenaline that is normally associated with a war room.

You can get started for free by visiting https://www.zebrium.com. Further details of PagerDuty and Slack incident augmentation will be coming soon.

Now that's MTTR!

Posted with permission of the original author Rod Bagg @ Zebrium

Is Log Management Still the Best Approach?

gdcohen — Tue, 02 Jun 2020 22:08:25 +0000

Disclosure -- I work for Zebrium. Part of our product does what most log managers do: aggregates logs, makes them searchable, allows filtering, provides easy navigation and lets you build alert rules. So why write this blog? Because in today's cloud native world (microservices, Kubernetes, distributed apps, rapid deployment, testing in production, etc.) while useful, log managers can be a time sink when it comes to detecting and tracking down the root cause of software incidents.

Quick history

When troubleshooting software problems, engineers have forever hunted through logs to find root cause.

This used to be done manually by writing perl scripts, or using native tools like vi, grep, sed, awk, etc. This changed in the early 2000's when log managers appeared on the scene, makiing this process smoother, faster and more scalable. Splunk, the first commercial product to arrive, was aptly termed "Google for log files".

Since then, log management tools have proliferated with a mix of open source tools like the Elastic Stack (often called ELK Stack, short for Elasticsearch, Logstash and Kibana) and commercial products like Sumologic. Today, they vie for leadership in dimensions such as cost, scalability, and speed of search, but they are all still based on the fundamental idea of making logs easier to search.

Is search still the right paradigm?

Troubleshooting and finding root cause require patience, skill, experience and intuition. With enough time, a skilled operator can usually determine root cause. The process typically starts with a search for keywords like error, critical, abort, panic, etc. And then continues with a journey of iterative searches that narrow things down to root cause.

"I don't know what I'm looking for, but I'll know when I find it!"

But some problems don't show themselves through obvious error severity log events and some are elusive because they're intermittent and hard to characterize, etc. Where do you even start when all you know is that a user said, "my screen just froze"?

Complexity and scale make things worse

When Splunk was founded in 2003, things were a lot simpler: Apps were monolithic, mostly deployed on-prem and log volumes were relatively small (fewer log files with fewer lines per file). When something went wrong, it would typically impact one customer at a time.

But with today's distributed SaaS applications and extensive use of microservices, things are very different. It's not uncommon to see apps that have hundreds of microservices and produce billions of log lines a day. The process of iteratively searching through so many logs across so many services, with a vast and ever-growing set of failure modes can be very daunting, especially while multiple customers are waiting for an issue to be resolved. How do you know what to search for and how do you sift through the huge volumes of data that your searches might return?

Unfortunately, human-driven troubleshooting doesn't scale with complexity, and this is impacting Mean-Time-To-Resolution (MTTR) which today is measured in hours and days, driven by the long tail of obscure and previously unknown issues that keep cropping up.

Parsing to get at the payload

Most logs are messy and unstructured (more accurately, there is usually clear structure for each individual type of log event, but within a log there might be thousands of different event types each with a different structure).

This is even true for logs that conform to standards like syslog RFC 5424. It defines that a log line should have structured components like a timestamp, severity, process ID, etc. But, what is often the most important part of a log line -- its payload -- has no structure at all (from the RFC, "The MSG part contains a free-form message that provides information about the event"). Even the recent trend of structuring logs in JSON or XML is often plagued with the same issue, where buried in a cleanly structured blob sits the same old payload in a single unstructured field!

Enter the parser...

Let's say it's really important to get at specific information inside log lines to troubleshoot a problem. E.g. you want to compare the number of times status "timeout" occurred across all the different instances of a service for an event like this:


May 27 11:56:11 host-47 myservice[24112]: Service instance X completed with status Y

You would need to parse out fields X and Y. In the past you would have written a script and crafted regular expressions (regex) using grep, sed or awk. Today, each log management tool provides its own mechanism to parse out specific fields. For example, the open source logstash grok plugin filter. Unfortunately, it takes a lot of time and effort to build and maintain these fragile parsing expressions. Especially as log events change across different software versions.

The quest for automation: do log managers help?

Automation is what sets apart the best DevOps teams. Google dedicates an entire chapter to automation in their famous SRE book. One of the headings in that chapter famously reads:

Log managers are useful for automating the detection and root cause identification of some software incidents. This is achieved by building alert rules that monitor logs for particular events, values within events and/or sequences of log events. As long as you know what the events/values are that characterize the problem, this approach can work well.

You can't automate the unknown unknowns

Unfortunately, you can only build alert rules for symptoms / causes that you understand and can be well-defined through log events. This leaves a big gap for incidents where you don't know the cause, the symptoms or both. In these cases, the troubleshooting process relies entirely on the skill of the operator and the searches and techniques used to uncover root cause -- which (as discussed above) is hard to scale as complexity increases. For example, you might be able to build an alert that tells you when latency is too high (known symptom), but you would still need to manually search for one of the thousands of possible root causes.

Is there a better way?

The idea of building a tool that aggregates and makes your logs searchable was revolutionary and log managers have proven to be useful and better than the alternative of writing scripts and building pipelines to collect and manage logs. But what worked well almost two decades ago ("Google for log files") hasn't kept up with the complexity of today. The manual process of searching for an obscure, previously unknown failure among hundreds of log streams can be extremely costly.

We believe that a fundamentally different approach is needed -- one that is based on machine learning rather than human-driven search. It might sound far-fetched, but despite the complexity of today's apps, software still breaks in fundamental ways that are visible as pattern changes in logs. In the same way that skilled DevOps engineers are adept at finding these patterns, machine learning models can be trained to do the same thing. And machine learning, unlike humans, can scale with complexity. If you're interested in learning more, please read our blog: The Future of Monitoring is Autonomous.

Written by: Gavin Cohen

Beyond Anomaly Detection: How Incident Recognition Drives down MTTR

gdcohen — Wed, 20 May 2020 17:40:46 +0000

The State of Monitoring

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes with obvious service impacting symptoms, it has two limitations:

There is a long tail of problems that are not service impacting in the aggregate, but impact some aspect of the user experience (e.g. software bugs that give a user an unexpected error, or incorrect results). Traditional (black box) monitoring approaches do not attempt to catch all of these due to the complexity of the task (setting up and maintaining rules for each unique problem).
The root cause is not identified -- many failure modes have similar symptoms, so it takes time consuming investigation to identify the root cause. And shrinking time to resolution is one of the biggest challenges of managing a modern service -- high Mean-Time-To-resolution (MTTR) hurts both customer loyalty and team productivity.

In either case, the burden is on the human -- to spot outliers in dashboards, drill down progressively, and inevitably - know what to search for in logs.

However, this doesn't scale as application complexity grows and failure modes multiply. There are simply too many unknown failure modes, too much data to scan, and too many variants of log messages to search for. Some new entrants to the space focus on improving the speed, cost or scalability of search. While welcome, this does nothing to address the bottleneck -- which is the human brain's ability to know what to search for in a given situation.

Machine Learning and Anomaly Detection To the Rescue -- Sort Of

Many tools have started to offer add-on machine learning features to augment human effort. While a welcome addition, it still leaves too much work for the human. Users are required to choose specific time series to track, choose the right anomaly detection technique for each one, and then be presented with visualizations of outliers to further analyze. This can be useful in scenarios where a user knows exactly what types of deviations they want to analyze (and for which metrics). But it is not useful for catching the wide range of unknown unknowns that might impact the application. And traditional tools only offer this toolset for time series data (e.g. metrics) -- it is rare to see a meaningful anomaly detection solution for logs, let alone the ability to correlate log and metric anomalies. This makes these tools blind to the root cause of any issue, and requires further time consuming drill down such as log searches.

The Evolution of Machine Learning -- Autonomous Incident Recognition

By contrast, Zebrium built machine learning (ML) based anomaly detection as the foundation, so took a very different approach. Our software automatically detects anomalies in ALL log events, and ALL metrics. We do all the work for the user -- no need to handpick which log events or metrics to track via anomaly detection, nor algorithms, corrections and other controls. We automatically apply the best technique for each type of data and adapt the behavior as the application changes.

But this is not the special part. What's unique about this approach is that it automatically identifies correlated patterns of anomalies that define service incidents. It doesn't just show pretty charts and make the user figure out how this might correlate to other issues in the application. Instead it automatically groups together all related log and metric anomalies, identifies the possible root cause, the hotspots (nodes, container/log types etc.), and creates a fully defined incident summary for the user. You can see some real life examples here.

This has two huge benefits:

It is capable of detecting the long tail of incidents that don't necessarily trigger the symptom alerts at an aggregate service level. This includes issues such as software bugs, latent infrastructure degradation, problems in inter-service interaction, database issues, container orchestration issues and even security issues.
It doesn't just detect an incident - by automatically creating a summary of all the anomalous events and metrics patterns surrounding the incident, it slashes time to root cause and resolution. And it does this not by correlating meta-data, notes or tags from a library of previously known incidents, rather it detects brand new incidents with high signal to noise. This includes all the unknown unknowns that crop up regularly in a modern cloud native application, without peppering them with lots of useless false positives.

This is why we believe Autonomous Monitoring, which includes autonomous Incident Recognition is the future of monitoring. You can try it for yourself. Getting started is free and takes less than two minutes.

Posted with permission of the author Ajay Singh.

Busting the Browser's Cache

gdcohen — Tue, 12 May 2020 21:56:42 +0000

A new release of your web service has just rolled out with some awesome new features and countless bug fixes. A few days later and you get a call: Why am I not seeing my what-ch-ma-call-it on my thing-a-ma-gig? After setting up that zoom call it is clear that the browser has cached old code, so you ask the person to hard reload the page with Ctrl-F5. Unless its a Mac in which case you need Command-Shift-R. And with IE you have to click on Refresh with Shift. You need to do this on the other page as well. Meet the browser cache, the bane of web service developers!

In this blog we share how we struggled and finally busted the browser cache for new releases of the Zebrium web service, including design and implementation details. Buckle up, it's a bumpy ride!

What Didn't Work?

At Zebrium we build our front-end using React. We find React to be extremely flexible, making it easy to write and maintain a variety of components from simple deployment drop-down menus to complex log and metric visualizations, all with a distinctive Zebrium dark-mode style.

Our build-test-deploy strategy is based on the create-react-app framework. Like React itself, that framework has served us well, but, like many who adopted it in the last few years, we suffered from one pretty big gotcha. Aggressive browser caching of application resources. So aggressive, that our users were missing out on key feature updates and bug fixes because the UI code they had in their browser cache was outdated. For a start-up with a need to quickly iterate on customer feedback, this was a real pain point.

Our customer-service team identified the issue first and the pattern of the problem was elusive. Many users would see the upgrades automatically. But some would not. Zebrium has always been lucky to have dedicated and enthusiastic users who understand our value proposition; luckily no more so than at moments like this. So, while we worked through the issue, customer-service helped affected users to clear their caches manually whenever we deployed a new version. But this was painful for us and the customers.

Before the UI team understood the root of the problem, we stepped through the usual remedies. We had our web server deliver headers with ever stricter cache-control settings. We reduced max-age from weeks to days and so on. That wasn't ideal because theoretically it meant users would be pulling down code versions their browser had already cached. We were surprised to see that approach did not solve the problem either. And we even threw pragma: no-cache at it, a Hail-Mary that unfortunately had no effect.

So, we began our investigation into create-react-app to discover why these tried-and-true HTTP client/server mechanisms were failing. After a lot of work, we finally isolated the issue to this: our version of create-react-app employed a service worker to cache content. That explained why some users encountered the problem while others did not. Users who were in the habit of closing their browser often did not see the problem. Users who kept their browser up for days and kept our app open in one or more tabs never saw our updates because the service worker was holding on to an old version of our UI code in cache. Here's a good discussion on create-react-app's Github page that lays out the issue and possible solutions ( https://github.com/facebook/create-react-app/issues/5316 ). At the time of our investigation, we weren't in a position to take and test a new version of the create-react-app framework or to test some of the workarounds mentioned in that discussion. So, we decided to go old school, exposing versioning in our app path. It has worked very well.

Summary of what we did

In every UI build, we set the software version as a custom environment variable in the .env file prefix with REACT_APP_. We can then access the current running version by referencing process.env.REACT_APP_MY_SOFTWARE_VERSION defined in .env. The current software version that the browser is running is also embedded in the URL and the software version is persisted throughout all UI route paths.

Whenever an API call is invoked from any page, it returns the software version currently running on the server. If the server and UI are in sync, the software versions will be the same. No more work to be done. However, if the API returned software version is different from process.env.REACT_APP_MY_SOFTWARE_VERSION, we throw up a popup dialog displaying a message saying a newer version has been detected. It includes a button the user can click to reload the page with content from the new software version. The newly loaded software version will then be reflected in the URL.

Now let's run through this in more detail...

Routing

Once we decided to take the version in the URL approach, everything was simple, right? Sort of. Our web pages are served from the same Go application that serves the API. We had the build script generate a bit of Go code to compile the release version into the binary and altered the routing to put the release version into the path for serving the static content of the UI. This handler function takes a http.FileSystem that is initialized to the root UI directory and a string with the release version:

func FileServerNotFoundRedirect(fs http.FileSystem, redirect string) http.Handler {
    fsh := http.FileServer(fs)
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if strings.HasPrefix(r.URL.Path, redirect) {
            r.URL.Path = r.URL.Path[len(redirect):]
            fd, err := fs.Open(path.Clean(r.URL.Path))
            if os.IsNotExist(err) {
                r.URL.Path = "/"
            }
            if err == nil {
                fd.Close()
            }
            fsh.ServeHTTP(w, r)
        } else {
            uri := r.RequestURI
            comps := strings.Split(uri, "/")
            if len(comps) > 1 {
                uri = uri[1+len(comps[1]):]
            }
            RedirectHTTPS(w, r, redirect+uri)
        }
    })
}

The first condition of the IF statement is fairly straight forward. When you have the release name at the start of the path, remove it and serve the request. Here when the requested file is not found we are serving up the root (index.html) required for routing within the UI. But what if the request comes in with an old release number? In that case we compose a new URL replacing the old version with the new one and then redirect the browser to it.


func RedirectHTTPS(w http.ResponseWriter, r *http.Request, redirect string) {
    url := fmt.Sprintf("%s://%s:%s%s",
        os.Getenv("ZWSD_PROTOCOL"),
        strings.Split(os.Getenv("ZWSD_DOMAINS"), ",")[0],
        os.Getenv("ZWSD_ORIGIN_PORT"),
        redirect)
    http.Redirect(w, r, url, http.StatusMovedPermanently)
}

It is important to note that we need the full browser's view of the URL beginning with the protocol (HTTP or HTTPS) and endpoint it is connecting to. This is the same server name that terminates an HTTPS connection which might be a proxy or load-balancer. Then we use the built-in "http" library to form a redirect response. This gets the new version into the browser's URL.

The last bit of work in the Go server was to return the version string on most every API request. We had already decided to encapsulate every response so adding the version involved adding a new tag to the top level:


{
    "data": [ array of data returned from the API ],
    "error": {
        "code": 200,
        "message": ""
    },
    "op": "read",
    "softwareRelease": "20200506071506"
}

Well, that's it! It was a long journey for us, but since making this change, we haven't been bitten by the browser cache again. And, as further proof that it's been working well, we've been delighted by how many more of our customers have started commenting on the great new what-ch-ma-call-it on my thing-a-ma-gig features we've been releasing 😀 We only wish we had done this sooner.

If you want to see it in action -- take our product for a free test run by visiting www.zebrium.com.

Posted with permission of the authors:
Alan Jones, Roy Selig and Ko Wang @ Zebrium