Forem: OverOps

The Slack Outage is a Wake Up Call For All IT Orgs

OverOps — Wed, 07 Aug 2019 15:35:54 +0000

Slack took quite a beating during and after the service outage that happened last Monday morning. Here’s a small sample of the headlines (before they were updated) that come up on a simple news search of Slack:

Slack is Experiencing Worldwide Outage, Degraded Performance
Yes, Slack is Down.
Happy Monday, Slack is Down
How Microsoft Teams May Have Caused Today’s Slack Outage

The last one stands out from the others - it was the only one that I actually clicked on. What do they mean Microsoft caused the outage? What’s the connection? Upon opening the article, I realized they weren’t talking about Microsoft teams, they were talking about Microsoft Teams.

Here’s the timeline they lay out to explain what they’re talking about:

July 11 - Microsoft announces they hit 13 million active daily users in - June (Slack reported 10 million users in January)
July 22 - Slack relaunches their desktop app to load faster and use less memory
July 29 - Slack goes down

Their theory, then, is that at some point Slack started to work on a major upgrade to their backend system. Feeling the pressure after Microsoft’s announcement, Slack pushed the new update to production before it was ready and all hell broke loose. Seems to make sense more or less.

Of course, engineers have probably been working on this update for several months at least, and it’s hard to say whether that announcement would be enough to interrupt the product roadmap. Still, it’s definitely possible that engineering managers started applying additional pressure on teams to build, test and deploy faster than originally expected. It’s hard to say.

One thing that is clear is that this outage was caused by changes made to the application's code. Similar outages reported a few days prior, on the 26th, were determined to be a result of changes made to the code the night before. The engineering team rolled back those changes and started to deploy intermittent fixes.

The latest outage on Monday morning follows the same story. The engineering team reported that on July 29th, they “made a change that inadvertently caused some performance issues, including messages failing to send.” Roughly an hour after the service went down, Slack had announced that users once again have the “all clear” to use their Slack channels without issue.

What caused the recent “Slack-out”?

In the space of a week, Slack’s engineering team deployed a massive update to their entire desktop application plus at least 2 other (presumably) smaller code changes. And they aren’t the only ones pushing code at a breakneck pace.

Over the last half a decade or so, the average release frequency in enterprise IT organizations has plummeted from around 12 months down to just 3 weeks. Many organizations, like Slack, are deploying new code to production weekly or even daily in an effort to out-innovate competitors and to please customers that are jonesing for new features.

Unfortunately, in many cases the increased pressure and the overall sentiment that “fast isn’t fast enough” comes at the expense of code quality and reliability. New automated testing frameworks and additional tooling have helped to limit the impact, but there’s still no way to account for every possible scenario.

How does this impact the company?

The most obvious way that application failures affect the business are through negative customer experiences. Unlike with a less critical error that affects only a handful of users, an application outage has the potential to unite the public against the company. As evidenced by this headline from CNN regarding the Slack outage:
Breaking: Slack Is Down, Twitter Goes Berserk

#SlackOutage, #SlackDown and others were trending last Monday, similar to recent outages from Facebook, Google and Twitter ironically (#TwitterDown was trending once the system was back up). Not only does this help to sway public opinion, it can begin to form a sort of herd mentality against the company. Just think about public opinion of major US airlines… We won’t name names.

In the short term, negative customer experiences on such a large scale hurt the brand’s reputation. In the long term, brand tarnishment coming from such events hurt the company’s bottom line.

Poor customer experience isn’t the only way these issues impact a company’s bottom line. Debugging and troubleshooting time means shifting developer and operations resources away from product innovation (which was our original goal…). Contractual SLAs may be breached which could lead to additional financial repercussions.

Plus, errors in general can contribute to higher log ingestion and storage costs and infrastructure overhead. Here’s a calculator you can use to find out how much your error volume is costing your company each year: https://calculator.overops.com/

How can you stop this from happening to you?

Companies like Slack are facing a paradoxical need to build and deploy faster than before while simultaneously improving the quality and reliability of their applications. With less time to write and test the code, improving--or even maintaining--code quality is no easy task.

In order to succeed, it’s important to track metrics that signal risk to the application and to create automated quality gates based on those metrics. In order to ensure a new release won’t impact customers once deployed to production, here are 4 crucial metrics to track:

New errors
Increasing errors
Resurfaced errors
Slowdowns In addition to data and metrics, accountability for engineers is incredibly important for ensuring code quality and application health. You can find more information about building a culture of accountability here.

JVM Architecture 101: Get to Know Your Virtual Machine

OverOps — Sun, 06 May 2018 08:50:36 +0000

Java applications are all around us, they’re on our phones, on our tablets, and on our computers. In many programming languages, this means compiling the code multiple times in order for it to run on different OSes. For us as developers, maybe the coolest thing about Java is that it’s designed to be platform-independent (as the old saying goes, “Write once, run anywhere”), so we only need to write and compile our code once.

How is this possible? Let’s dig into the Java Virtual Machine (JVM) to find out.

The JVM Architecture

It may sound surprising, but the JVM itself knows nothing about the Java programming language. Instead, it knows how to execute its own instruction set, called Java bytecode, which is organized in binary class files. Java code is compiled by the javac command into Java bytecode, which in turn gets translated into machine instructions by the JVM at runtime.

Threads

Java is designed to be concurrent, which means that different calculations can be performed at the same time by running several threads within the same process. When a new JVM process starts, a new thread (called the main thread) is created within the JVM. From this main thread, the code starts to run and other threads can be spawned. Real applications can have thousands of running threads that serve different purposes. Some serve user requests, others execute asynchronous backend tasks, etc.

Stack and Frames

Each Java thread is created along with a frame stack designed to hold method frames and to control method invocation and return. A method frame is used to store data and partial calculations of the method to which it belongs. When the method returns, its frame is discarded. Then, its return value is passed back to the invoker frame that can now use it to complete its own calculation.

The JVM playground for executing a method is the method frame. The frame consists of two main parts:

Local Variables Array – where the method’s parameters and local variables are stored
Operand Stack – where the method’s computations are performed

How It Works

Let’s go over a simple example to understand how the different elements play together to run our program. Assume we have this simple program that calculates the value of 2+3 and prints the result:

class SimpleExample {
    public static void main(String[] args) {
        int result = add(2,3);
        System.out.println(result);
    }

    public static int add(int a, int b) {
        return a+b;
    }
}

To compile this class we run javac SimpleExample.java, which results in the compiled file SimpleExample.class. We already know this is a binary file that contains bytecode. So how can we inspect the class bytecode? Using javap.

javap is a command line tool that comes with the JDK and can disassemble class files. Calling javap -c -p prints out the disassembled bytecode (-c) of the class, including private (-p) members and methods:

Compiled from "SimpleExample.java"
class SimpleExample {
  SimpleExample();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public static void main(java.lang.String[]);
    Code:
       0: iconst_2
       1: iconst_3
       2: invokestatic  #2                  // Method add:(II)I
       5: istore_1
       6: getstatic     #3                  // Field java/lang/System.out:Ljava/io/PrintStream;
       9: iload_1
      10: invokevirtual #4                  // Method java/io/PrintStream.println:(I)V
      13: return

  public static int add(int, int);
    Code:
       0: iload_0
       1: iload_1
       2: iadd
       3: ireturn
}

Now what happens inside the JVM at runtime? java SimpleExample starts a new JVM process and the main thread is created. A new frame is created for the main method and pushed into the thread stack.

public static void main(java.lang.String[]);
  Code:
     0: iconst_2
     1: iconst_3
     2: invokestatic  #2                  // Method add:(II)I
     5: istore_1
     6: getstatic     #3                  // Field java/lang/System.out:Ljava/io/PrintStream;
     9: iload_1
    10: invokevirtual #4                  // Method java/io/PrintStream.println:(I)V
    13: return

The main method has two variables: args and result. Both reside in the local variable table. The first two bytecode commands of main, iconst_2, and iconst_3, load the constant values 2 and 3 (respectively) into the operand stack. The next command invokestatic invokes the static method add. Since this method expects two integers as arguments, invokestatic pops two elements from the operand stack and passes them to the new frame created by the JVM for add. main’s operand stack is empty at this point.

public static int add(int, int);
  Code:
     0: iload_0
     1: iload_1
     2: iadd
     3: ireturn

In the add frame, these arguments are stored in the local variable array. The first two bytecode commands, iload_0 and iload_1 load the 0th and the 1st local variables into the stack. Next, iadd pops the top two elements from the operand stack, sums them up, and pushes the result back into the stack. Finally, ireturn pops the top element and passes it to the calling frame as the return value of the method, and the frame is discarded.

public static void main(java.lang.String[]);
  Code:
     0: iconst_2
     1: iconst_3
     2: invokestatic  #2                  // Method add:(II)I
     5: istore_1
     6: getstatic     #3                  // Field java/lang/System.out:Ljava/io/PrintStream;
     9: iload_1
    10: invokevirtual #4                  // Method java/io/PrintStream.println:(I)V
    13: return

main’s stack now holds the return value of add. istore_1 pops it and sets it as the value of the variable at index 1, which is result. getstatic pushes the static field java/lang/System.out of type java/io/PrintStream onto the stack. iload_1 pushes the variable at index 1, which is the value of result that now equals 5, onto the stack.

So at this point the stack holds 2 values: the ‘out’ field and the value 5. Now invokevirtual is about to invoke the PrintStream.println method. It pops two elements from the stack: the first one is a reference to the object for which the println method is going to be invoked. The second element is an integer argument to be passed to the println method, that expects a single argument. This is where the main method prints the result of add. Finally, the return command finishes the method. The main frame is discarded, and the JVM process ends.

“Write Once, Run Anywhere”

So what makes Java platform-independent? It all lies in the bytecode.

As we saw, any Java program compiles into standard Java bytecode. The JVM then translates it into the specific machine instructions at runtime. We no longer need to make sure our code is machine-compatible. Instead, our application can run on any device equipped with a JVM, and the JVM will do it for us. It’s the job of the JVM’s maintainers to provide different versions of JVMs to support different machines and operating systems.

This architecture enables any Java program to run on any device having a JVM installed on it. And so the magic happens.

Final Thoughts

Java developers can write great applications without understanding how the JVM works.

However, digging into the JVM architecture, learning its structure, and realizing how it interprets your code will help you become a better developer. It will also help you tackle really complex problem from time to time 🙂

PS. If you’re looking for a deeper dive into the JVM and how all of this relates to Java exceptions, look no further! (It’s all right here.)

Written by Tzofia Shiftan. First published on The OverOps Blog.

Why I Deleted My IDE; and How It Changed My Life For the Better

OverOps — Wed, 18 Apr 2018 09:16:08 +0000

About 3 years ago, I made a big change in the way that I write code. It occurred to me that in a lot of cases, my IDE was slowing me down more than it was helping me work. So, I made the drastic decision to delete it completely.

Of course, many people are shocked when they hear about this, and it’s definitely not recommended for every developer. In this post, I’ll share the motivations behind deleting the IDE in the first place, how I moved forward without it and who else may want to consider this as an option.

Ready?

5 Reasons Why I Deleted My IDE

1. System Performance

I work with many programming languages at the same time: Java for our servers, C++ for some of our clients, JavaScript+CSS+HTML for our frontend, along with Groovy+Bash for some automated tasks. Running multiple IDEs for each language required too many resources (i.e. CPU/RAM) which led to issues running the actual program I was developing. Some of the IDEs froze from time to time while others just crashed. With all of this, it took too long to compile, link and publish the code.

Plus, it took it a long time to warm up everytime I would reboot my computer because it was trying to run so many IDEs. To avoid the long wait, I started to put my computer in sleep mode for months instead of shutting down which caused the OS to become slower over time.

2. System Reliability – “Voodoo”

As mentioned above, when you work with an IDE, you have to deal with it just crashing for no reason. It happened to me especially with Eclipse and XCode but also with other IDEs.

Other times, exceptions in your program cause bugs in the IDE or unexpected behavior. Most of the time either clean+build or clean+publish solves it, other times you have to close the IDE, clean its metadata and configure it from scratch.

For example, there were times I just couldn’t stop the server, probably because of a bug in my code, however it’s reasonable to expect the IDE to understand that I’m in the middle of development and allow me to kill the process if bugs occur.

3. Multiple Environments

My day-to-day work requires me to switch between building features, code reviews and bug fixing. I want the action of switching between tasks to be cheap (in time) and to be able to work on many things in parallel. With an IDE this is much harder to achieve. And there are two reasons why. First, because it uses a lot of CPU, RAM, etc.. And second, multiple instances of the same IDE might interfere with each other.

On top of that, when switching between branches during work, which happens a lot because I work on features and bugs at the same time, it took it forever to re-index the code and get ready for work.

4. Working on Remote Servers

For my work, I sometimes have to run multiple servers on multiple remote machines and run tests against them. I don’t like to build a package and install it just for the testing. I prefer to use these machines as if they were mine, and working without an IDE makes this task much easier.

Still, it’s more comfortable to edit the code using my favorite text editor (Sublime) instead of using vim or other command line-based editors. I overcome this issue by using a combination of rsync, git push/pull, sftp or even just copy/paste into vim.

5. Accessibility

Even when the IDE is fully configured with custom shortcuts, there are actions that require the use of the mouse. The problem with this is that even just moving your hand from the keyboard to the mouse for ordering, resizing views or moving the keyboard focus to a certain part of the IDE slows you down. This is a matter of personal preference, but I prefer to avoid this as much as possible.

Another important thing regarding the IDE is the amount of space it takes from the screen: one row for the window title, another one for menus and another for the toolbar. And that’s even before the tab switcher between open files starts. That causes the code itself to be limited to only part of the screen and uncomfortable to work on.

So, You Deleted Your IDE — What’s Next?

1. Learning About IDE Functions

First, you’ll have to learn about what the IDE does for you, how it compiles the code, how it publishes it to the servers, and how the files are organized in the file system. Luckily for me, OverOps uses standard build tools like Gradle, CMake, Bash and Groovy which can be run directly from the command line.

So, I run these tools from the command line and use a combination of Bash commands in order to publish and run OverOps. Basic file operation commands like cp, mv, rm to manipulate the files and less, nohup, kill, ps, htop, etc can then be used in order to monitor and control the program. Just like it is done in the production environment.

2. Building Your Own Tooling

After a while, you will start to realize which commands are used the most and you can start to create some aliases for them.

For my own work, I put those aliases in a ~/.bashrc file which is included automatically in every Bash session. Once the file got bigger and included more valuable aliases, I decided it was the right time to back it up. I created a new private git repository (used bitbucket for it) and put the file there. I cloned the repository into a local directory and used the source command to include it from the ~/.bashrc file.

At this point, I started to really enjoy this way of working, without an IDE. So, I decided it was time to put some extra effort into modularizing this file like I would for normal code. I created one main file to contain smaller files. Each file has a job, like git.sh contains all the aliases and functions related to working with git, while docker.sh contains the basic commands to work with docker. There are also some common files which contain aliases and functions unrelated to any specific program, and some families of files related to specific components in projects I work on, like moduleXXX.sh.

3. Discovering What You Can Do That Your IDE Couldn’t

One interesting thing about this work style, is that you have the power to create scripts that do complicated tasks that you could never imagine having as a part of the IDE. You can create functions which automate the process of building, running, calling some internal code, waiting for close and checking the error code in the end. All of this with one simple command. Of course, this can also be achieved from within the IDE by creating a plugin, but writing a plugin takes much more effort than just writing some Bash function.

A good example of this is one of the main projects I led in my company. We needed to do a port of our agent to the AIX platform, but at first we couldn’t find an easy way to run X Server with a GUI on it. Without the IDE, though, this wasn’t a problem. I just installed Bash on that machine and worked on it like it was my own development machine.

At that time, I realized the real power of working without an IDE was with the remote machines. I could run benchmarks on our product across many machines and test network issues like load balancing, firewalls etc. If an error comes up, no problem, I just add a few debug prints and start it over. The cycle of rebuild is reduced to less than a minute instead of rebuilding a debug version using the build machines.

Who Should Consider Working Without an IDE?

Working without an IDE is not an easy task. It requires a deep knowledge of the technology you are working with, plus you will need to be familiar with the shell environment of the OS you use.

The learning curve of working without an IDE is steeper than working with an IDE, so this isn’t recommended for beginners. Also, the benefits you get by working without an IDE won’t be worth the effort. Usually, there is no need for working on remote machines or jumping between tasks as a beginner.

This work style is more suitable for technical leads who jump between tasks frequently; the ones who need to run a Proof of Concept (POC) for some new feature, fix a bug in production and review a new developed feature all within a couple of hours. In these cases, the context switching between tasks is much cheaper if working in multiple environments.

It can also work well for developers who do Ops tasks like benchmarking, bug investigation in production, networking, or any tasks that need to run on remote machines.

Debugging in Development Without the IDE

Debugging is another story. Personally, I never was a big fan of IDE debuggers since debugging skills are very specific to the language, IDE and operating system. That’s why even when I worked with an IDE I rarely used the debugger. Instead I prefer to use debug prints. They work for every language, on every platform. The code should be modeled in a way that helps you to run parts of the code easily without needing to run the whole server for it. Some of the things that can help with this is to avoid shared states and to decouple components.

Luckily, deciding to work without an IDE won’t affect the way you debug in production or pre-production (staging). This will only change your backend development workflow, so all of your current monitoring tools will work as usual (that means you can still use OverOps! 😉 ).

Final Thoughts

(Edited) Concerning refactoring, I will say that code refactoring is definitely easier with an IDE since they have a built-in support for it. Without the IDE, I use regexp or some other small dedicated scripts, depending on the specific task. However, refactoring with regexp will make you an expert with them and it is quite a handy tool to master. Sublime, which is the primary text editor I use, can then index thousand of files very quickly and allows you to find/replace in less than a second. (/Edited)

I won’t try to convince you that every developer should immediately delete their IDE and start working the way I do. Aside from the effort of configuring the environment for the first time (which may take weeks), from time to time you still need to stop for maintenance, add some aliases, remove others, add support for a new deployment mechanism, etc.

For any developer that works comfortably with the IDE, this isn’t recommended. But, for anyone that is often irritated by the performance and functionality of the IDE, you should know that there is a way forward without it.

Written by David Levanon. Originally published on The OverOps Blog

5 Ways Developers Waste More Than 20% of Their Work Week

OverOps — Tue, 06 Feb 2018 13:23:22 +0000

We looked into the most time-consuming tasks for developers. It turns out, more than 25% of their time, on average, is spent troubleshooting.

More than a full day out of a developer’s work week can be spent troubleshooting production errors. In many cases, it’s even more than that. We hear all the time from engineering teams that their developers are spending at least 25% of their time, on average, solving (or trying to solve) production issues. That means they’re dedicating more than a full day of their work week to troubleshooting.

Where does all of this time come from? And how does it add up so quickly?

1. Identifying there was an error

The first step in solving any problem is admitting you have one in the first place. Then, you should probably figure out what the problem is, if you don’t already know. Surprisingly (or not), this is actually one of the parts in the debugging process that gives developers the most trouble.

How do you first figure out that something isn’t working right?

From the people that we spoke with, we learned that, on average, well over half of production errors are reported by the end users. Many of those companies rely on users to provide feedback for up to 80-90% of errors.

Right away, we know that it isn’t good practice to rely on your users to tell you when you have a problem or that something isn’t working right. Even worse, you don’t want to have problems that AREN’T being reported by users, because that will relate directly to low customer satisfaction and customer churn.

The problem with relying on end users to report errors is that more often than not, their reports are missing critical information about the error. The dialogue, if you’re lucky enough to be able to have one, takes up a lot of time and there’s a lot of back and forth between engineering, support, and QA that sucks up tremendous amounts of time to retrieve information for troubleshooting.

2. Determining the severity of the error

Whether your information is coming from user feedback or is showing up in the logs, determining the error rate is essential to recognizing if you’re dealing with a critical issue or not. If the error happens infrequently and doesn’t have a large impact on users, dedicating time to reproducing and resolving the problem may not be worth the cost when there are much more critical errors happening.

When relying on logs and end users, your means for understanding error rates and severity are limited. When looking at exceptions, logged warnings or errors, the error rate is a key metric for triaging the system’s health as a whole. To identify what are the most critical issues that require our developers’ attention, we use OverOps to sort through new errors and their frequency, before zooming in on their root cause analysis. Some of our most successful users are implementing an “Inbox Zero” approach to exceptions, as if they were emails that require handling. Here’s how they do it.

3. Locating the affected code (or, Sifting through log files)

Once the error has been identified, it’s time to find its actual location in the code and make sure it’s assigned to the right developer. There are a couple of ways to do this, but the most common practice is to spend hours, or even days, sifting through log files and looking for clues. Aside from this being comparable to looking for a needle in a haystack, the developer who is tasked with resolving the issue may not have a clear idea of what he or she is looking for. Plus, the information they need may not have been written to the logs at all.

Using log management tools, like Splunk or Elk, can help cut through the noise, but they can’t help when the piece of information that you’re looking for was never written to the logs to begin with. In that case, the only way to get the information you need is to add logging verbosity and hope that when (if!) the error happens again, you can see what’s going on.

4. Reproducing the error

If your logs don’t give you a clear answer to why the error happened, and this is highly likely, trying to reproduce it is the next step. Plus, reproducing an error before you attempt to fix it is good practice regardless. After all, if you don’t first reproduce the problem then how can you be sure that you’ve fixed it when you’re done debugging?

Unfortunately, with vague reports coming from users and unclear logging statements surrounding the error, finding the exact flow of events that caused it takes a lot of time and may even be impossible.

If you’re lucky, the problem might have occurred in a part of the code that you recently worked on and you can easily deduce the steps that led to it. Otherwise, the most common way of finding the event flow is to try different things and observe the results, send the ticket to QA for more testing, look through the logs or maybe adding additional logging statements. More often than not, failure in this step is what causes tickets to be closed with the infamous… “could not reproduce”.

5. Entering “war room mode”

It’s hard to dispute the benefits of working in a “war room-like” set-up. It can bring with it project clarity, stronger communication and increased productivity. So, maybe you’re thinking about rearranging the furniture in your office to boost innovation and progress, or maybe your application crashed for some unknown reason and you need to figure out why RIGHT NOW.

If you’re lucky, you might only have to do a war room once or twice a year when something catastrophic happens. For others, war room situations occur on a weekly, or even daily, basis. With so much time being dedicated to resolve issues, there’s hardly any time left to advance the product roadmap.

Some of the developers that we’ve spoken with recently described gathering for a war room situation and still not having the information to move forward with a solution. Some described war room situations that lasted for 5 or 6 days. That’s bad. Not only do these situations take time away from the rest of your work tasks, it can hurt the reputation and revenue of the company.

The Cure

The most effective way that we’ve found to cut down on the time that your team spends debugging is to automate the error resolution process. That means automating not only the identification of errors, but more importantly the automation of root cause analysis, with access to the complete source code and variable state across the entire call stack of every error.

OverOps created a tool that does just that. Not only does it provide you with the source and full call stack of any exception or error, it reveals the exact variable state at the time of error so teams no longer need to spend endless hours trying to reproduce it. Check out how it works here.

Final Thoughts

Production debugging is absolutely necessary, but letting it take up so much time ISN’T! That 25% adds up quickly. It adds up to a full-time salaried developer for every 3 working on new features and projects. That’s crazy!

It’s time to start automating the production debugging process so that you can get out of the war room and back to your backlog.

Written by Tali Soroker, originally published on the OverOps Blog.