Forem: Shannon Winter

Behind the scenes of our security incident management process

Shannon Winter — Wed, 13 Feb 2019 21:06:16 +0000

On the security team, we don’t manage any Atlassian products like other Atlassian teams do. Our main product is trust, and that’s a job that’s never finished.

To me, security is more of a mindset; one of constant diligence, continuous improvement, and seeking out ways to innovate.

Sometimes security teams can act like more of a blocker than a facilitator, holding up products with burdensome gate checks. But my team strives to assist in the creation and management of secure software, making sure that teams can collaborate and get their work done in the safest way possible. When great products get shipped in a safe and secure way, we know we’ve done our job. To that end, there are three principles that guide our work and inform our action plans and responses to security incidents:

Our guiding principles

1. Be open and available

No one benefits from a security team that works in the shadows or doesn’t share information. We know that if we want to be involved in the development of safe software we have to be approachable and available to everyone: engineering, support, partners, and customers. And we encourage anyone with a concern to report it to us, no matter the severity of the issue or their role.

2. Be consistent in word and action

Being available to people wouldn’t mean much if we weren’t also consistent in how we approached our work. In order to preserve and enforce policies and procedures, we all need to be on the same page and act in predictable ways. We document how we work and then we share this information across our internal collaboration tools and with our broader customer and partner community. We also publish security stats, data, policies, and procedures on our public trust website.

3. Always seek better ways

Consistency doesn’t benefit anyone if we’re consistently, well, wrong. So we’re always striving to improve our monitoring, our tooling, our procedures, and to keep ahead of potential risks. We have a team whose main purpose is to study vulnerabilities in the wild and then come up with ways to fortify our systems against them so we’re not losing ground by constantly being in reaction mode.

Meet our team

Security is built into all of our products and our processes which are shared across the company and the community. There are three main groups at Atlassian that work actively on preventing and responding to security incidents:

Security Engineering. These are the people that review code and look into the security of our products, making sure that they uncover vulnerabilities and address them in a timely fashion, preferably before they ship! We want all of our products to be secure from the ground up. Often members of a product’s security engineering team are involved in the response to a security incident so they have the context they need when it comes to remediating the vulnerability.

Security Intelligence. This team actively looks for suspicious things happening, or things that could happen, on the entire network. They respond to reports from customers and partners, as well the as the internal teams they work with. This is the group that actively protects our products and systems against vulnerabilities and directly responds to incidents.

Policies and Trust. This team is responsible for the communications of our security rules and policies and they publish to the trust website I mentioned above. The information they publish is meant to be useful to the broader security and development community, not just Atlassian customers. Again, referring back to our three guiding principles, we want to make information available to everyone and break down the barriers that often surround security discussion.

How we build our stack

People often ask us what we’re using in our own stack, especially as it pertains to security. We use a mix of our own products plus integrate with a few other best-of-breed tools. Being able to share information across the organization – and with partners and customers – is our first priority, so we naturally make sure that our stack is harmonized for this purpose.

Here’s a list of products we use and how we use them:

*Comprehensive detection and analysis: * We use Splunk to query our systems and uses heuristic analysis and anomaly detection based on policies our security intelligence team writes. When it comes to risk we want to cover all of our bases, so we write policies based on historical and theoretical incidents.

Agile communications loop: If we find something that just requires a quick fix we take care of it in the moment and then log what happened. But for something more involved we first record the incident in Jira, and then also in Slack, for broader communication.

Connected conversations: Sometimes alerts come in from customers or partners via Jira Service Desk and then go to Opsgenie to alert the right people. These alerts are also sent to Jira, and then Slack, so transparency is broad and everyone has visibility.

*Knowledge capture and transfer: * A lot of our playbooks are stored in Confluence and if we need to use any of them as a guide to a response we’ll reference them in the Jira ticket. From there, if a conversation also takes place in email, that information gets logged in Jira, too. And if the incident gets updated in Jira we’ll see that in Slack. We’ve created a really smooth way to make it so people can give input via the tool that makes the most sense for them and we don’t have to worry about people being left in a communication silo. Everyone can have access to everything.

How we respond

We have a predefined way, following industry best practices, that we respond to incidents, and the security intelligence team spends a lot of time detailing these processes out and training people on how to follow them. We do this so we don’t react too quickly and make the wrong fix, but really take the time to investigate an incident and make sure we agree on the approach to remediation.

*Detect and analyze. * As I said before, this is something that the Security Intelligence team focuses a lot of our time on. We write queries, look for certain vulnerable services, and measure the severity of any issues so we can determine what our response will be.
Investigate. Once an issue has been detected, and we’ve got an idea of the nature of the issue, we’ll conduct an investigation to determine the severity and urgency of the issue. For this, we use the security classification system from the VERIS community. This helps us make sure we’re using our resources effectively and not over- or under-reacting to incidents.
*Contain and eradicate. * In fact, up until something is determined to be a threat, we don’t call it an incident. But once we make that determination, and we’ve classified it, we call on those planned responses that the Security Intelligence team spends so much of time creating. We figure out who’s vulnerable, build the fix, work with the product teams to get the patch ready and make sure it’s all ready to go.
*Communicate. * Once we have the fix ready to deploy we work with marketing, and sometimes legal, to make sure our communications to customers are timely and clear, and that everyone understands how to install the code fixes. Much of this work is done in Confluence where we can review the draft and make comments and edits and ensure the announcement is rock solid.
*Post-incident review. * After we’re confident the problem is fixed and everyone has what they need, we do a post-incident review, called a PIR, and we track this in Jira. This is usually a collection of tasks we assign ourselves to take care of any actions that we need to take, like any tweaks to our response process or any people who may need some new training, and we assign deadlines to these tasks. We do this within the first week of the incident when everything is sharp in our minds. After deploying the fix this is one of the best ways to make sure our products and systems are safe.

We’ve published more detailed information about how our team responds to incidents here in the Trust section of the website.

As you can see, everyone at Atlassian takes security pretty seriously and we devote a lot of effort to making sure we have dedicated teams that build safe software and maintain our systems to keep them as secure as possible. And we think that focusing on the three guiding values—being open and available, being consistent, and always seek better ways—is a great approach for building trust with our community.

The post Behind the scenes of our security incident management process appeared first on Atlassian Blog.

194 years of downtime: looking back on incident data from 2018

Shannon Winter — Thu, 13 Dec 2018 16:55:11 +0000

Statuspage customers logged more than 194 years of collective incidents in 2018. That’s a whopping 87% increase from the 104 years logged in 2017, and we aren’t even through December yet.

Open incident communication is becoming more and more important to companies and their customers. This is underlined by the big names who have set up a public Statuspage this year like Github, LinkedIn, and Yelp. With more focus on incident communication comes more focus on incident management in general. Companies are spending more time and resources preparing for downtime, as we learned from a handful of customers we profiled on how they prepare for high traffic days.

We dug deeper into our 2018 data to get a better idea of when and how our customers communicated around downtime this year. The data represents all reported incidents – from small blips in service to large-scale outages – plus any planned downtime logged through scheduled maintenance.

What the numbers mean

Sure, the sharp increase in hours of incidents logged from 2017 to 2018 can in part be attributed to an increase in total number of Statuspage customers, but we also believe it reflects the increasingly cloud-first mentality of companies relying on SaaS products. Companies are choosing to communicate around these incidents, and customers have come to expect this type of transparency.

// Detect dark theme var iframe = document.getElementById('tweet-1018273597045932032-59'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1018273597045932032&theme=dark" }

In addition to the jump in number of incidents logged this year, we also saw the average number of updates per incident nearly double. With an average of 4.4 updates per incident this year, we believe that companies are prioritizing frequent, transparent communication with their customers.

We were also surprised to see that nearly half of our customers (45% to be exact) have opted into some form of page automation by integrating with an alerting or monitoring tool. While we advocate for always keeping a human element in your incident comms process, setting up some level of automation can definitely save time when it matters most. Many customers take this hybrid manual/automated approach to save time without risking a poor customer experience.

While incidents logged and updates posted are rising, there are still very few postmortems written – only 3% of incidents logged in Statuspage over 2018 had a postmortem attached. This isn’t too surprising, as not every incident requires a postmortem (and some companies write postmortems on a company blog instead), but we imagine this percentage rising over 2019 as customers come to expect this type of follow-up.

// Detect dark theme var iframe = document.getElementById('tweet-1068941649890275328-37'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1068941649890275328&theme=dark" }

Stand-out incidents

There are some days when downtime is more likely for certain companies or industries. Cyber Monday is one example – a day where e-commerce companies see an exponential increase in traffic to their websites or apps. For Amazon, Prime Day (their biggest sale of the year) is that day – rivaling even the craziest Black Friday and Cyber Monday traffic. Though the retail giant still achieved a record year in sales, shoppers had trouble connecting to Amazon.com for over an hour, causing a lot of customer frustration and an estimate of up to 100 million dollars in revenue loss. The silver lining was a flood of cute dog pictures on Twitter, showcasing the power of a great error page:

// Detect dark theme var iframe = document.getElementById('tweet-1018935253543636992-798'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1018935253543636992&theme=dark" }

For Epic Games , their “prime” traffic days came as players flocked to play their very popular video game, Fortnite. They experienced periods where over 3 million gamers were playing concurrently, resulting in some big service interruptions. During an incident in June, players from all over the world headed to Epic Game’s status page to see what was going on, resulting in a peak of about 15,000 requests per second.(Our most highly trafficked incident to date.) Major kudos to Epic Games for writing very thorough postmortems to close the loop on big incidents.

Some form of downtime is inevitable – especially with an extreme load like the one Fortnite experiences. Epic Games shows us that it’s how you handle that downtime and communicate with your customers that really matters.

And we can’t forget the IRS , which had an unusually stressful 2018 Tax Day when their website crashed on April 17th, the tax filing deadline. This was highly problematic as approximately 10 million Americans wait to submit their taxes on the last day. They ended up extending the deadline to April 18th, but communication in the meantime wasn’t exactly ideal. The original IRS error message reported a planned downtime event from April 17th, 2018 to Dec 31st, 9999 – yikes.

Downtime happens to the best of us, but accurate and frequent updates go a long way. We wrote an open letter to the IRS offering some advice and a free Statuspage – offer valid until Tax Day 2019. We’re still waiting for them to take us up on it.

#HugOps for 2019

While there may have been more hours of downtime this year, there was also a lot more love and appreciation (#HugOps) shown to the companies who were open about the bad times – more than 7,000 tweets and retweets mentioning HugOps, in fact. We started sending actual HugOps posters to people who retweeted our digital HugOps posters, and have sent more than 70 this year. That means 1% of all HugOps tweeters are now proudly displaying a Statuspage HugOps poster in their office like the one below – hooray!

// Detect dark theme var iframe = document.getElementById('tweet-1038357208948502529-775'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1038357208948502529&theme=dark" }

The latest in Atlassian for incident management

While incident communication is a large part of incident management, it’s only one piece of a bigger puzzle. At Atlassian we’ve doubled down on our investment in incident management tools and practices. Check out what we’ve been up to:

*Postmortems for * *Jira Ops: *One of the most important parts of the incident management process is the postmortem. This is where incident response teams can learn, improve, and collect all the returns for the time and investment made trying to resolve the incident. Unfortunately, the postmortem process is often neglected because it’s too time consuming and difficult to manage. A key time-saver with JiraOps postmortems is the incident timeline, which gathers all the key events from the incident in chronological order. Teams can analyze what happened, identify root causes, and create Jira Software issues directly from the postmortem to ensure actions are taken to improve from every incident. Learn more.

*Automation Actions for * *Opsgenie: *Incident responders often take predictable, repetitive actions in response to an alert. These actions might include gathering more info about a particular system, running network diagnostics, increasing cloud resources, or restarting a service. Automation Actions enable you to run automated scripts and playbooks via 3rd-party platforms. Opsgenie now offers support for two automation integration methods: AWS Systems Manager and Generic REST Endpoint. Teams can integrate with these platforms to trigger the automated tasks right from the Opsgenie console or mobile app. This saves responders time, reduces the number of applications they need to use during incident response, and can positively impact MTTR. Learn more.

Tweet this report, get a poster

Anyone who tweets will receive a free HugOps poster to display as a reminder that your team is supported when downtime strikes in 2019…

This article appeared first on Atlassian Blog.