Forem: Public Cloud Group

Incidents and Operational Resiliency - Why it Matters and What to Consider

Public Cloud Group — Fri, 10 May 2024 13:47:44 +0000

Written by Thomas Hoffmann

Utilizing technology and new work methods to save money

Introduction

Are you prepared for an IT incident? If a core component of your infrastructure suddenly fails or data is lost, who would you contact? About two thirds of German mid-size companies would shrug off an answer to these questions [1] - even though being prepared could save enormous costs.

Facts and Data

An average IT outage in a mid-sized company (200-5,000 employees) costs about 25,000 Euro per hour according to a recent study. On average, German companies experience up to four of these outages per year where each outage lasts about 3.8 hours - causing an annual economic damage of over 380,000 Euro! [1]

There are many reasons for the high level of damage: even though only one third of the outages registered has any impact on customer operations [2], internal disruptions can also lead to widespread loss of productivity.

Causes and Mitigation

A process review can offer the greatest remedy: a full 20% of outages can be traced to poor process adherence [2]. At this point, the root cause must be carefully examined: the reason for the process deviation is an important clue as to what can be improved. In times of hybrid and remote work environments, the requirements for processes change as well. "People before process" is a helpful mantra to consider, keeping focused on adjusting processes to the needs of your staff. This does not mean that "sensitivities" should dictate work flows, but that processes should support employees in their work as much as possible and not hinder them.

But it is also possible to use technological influence: especially hyper-scalers such as AWS offer a plethora of possibilities to react to outages and errors in a variety of ways - be it by utilizing smart monitoring and alerting or even automatic error resolution, for example by restarting a certain service or machine.

The choice of your cloud provider is therefore the first factor in a resilient infrastructure: AWS is one of the few cloud providers that has guaranteed the physical separation of its availability zones since its early years, thereby establishing geophysical redundancy. Microsoft Azure only established mandatory geophysical separation in 2018 [3] and Google Cloud Platform still does not guarantee significant physical separation of its zones, although they famously provided the reason why this makes sense in 2023. [4]

The services and technologies used are also a key factor to achieving resiliency: smart monitoring and logging as well as properly configured autoscaling and established error management already go a long way.

Finally, a missing or unknown disaster recovery concept is another reason for long-lasting outages. While a prepared company can ideally restore basic operation with the push of a button, unprepared ones often times have to take inventory first to see what actually needs to be worked on to restore operations.

A famous scenario where you directly profit from being prepared would be a ransomware attack. This not only affects your applications, but also renders all company data inaccessible. A well-architected cloud infrastructure with protected (tamper-proof) backups can safe significant amounts of time and money in this case: affected applications can be quickly terminated and well-trained data recovery operations can reduce any data loss to an acceptable level.

Conclusion

Incidents and outages are expensive. Even more so with a growing number of employees and/or customers. Having to deal with an outage ad-hoc is prone to errors and takes a lot of time. Being prepared by utilizing current data from economy and research and establishing change, incident and recovery procedures will help to avoid incidents or at least keep them short. To prepare, all parts, processes, infrastructure and threat models of your production chain should be considered.

Make our Expertise Your Own!

As an AWS strategic partner, kreuzwerker can call on many years of experience to support you on your journey to resiliency: from best practice and process reviews and optimization over to providing expert knowledge on various AWS technologies, on observability and ElasticSearch as well as orchestrating your microservice deployments on Kubernetes - we are happy to support you to the best of our ability.

Don't hesitate to get in touch if you wish to review your infrastructure and processes in regards to resilience - we look forward to working with and supporting you on your resiliency journey!

[1] https://digitalisationworld.com/news/27800/hp-studie-it-systemausf-auml-lle-kosten-deutsche-mittelst-auml-ndler-im-durchschnitt-fast-400000-euro-pro-jahr

[2] Uptime Institute, Annual Outage Analysis 2023

[3] https://azure.microsoft.com/en-us/blog/azure-availability-zones-now-available-for-the-most-comprehensive-resiliency-strategy/

[4] https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPfkY

Enhancing Customer Workloads

Public Cloud Group — Fri, 26 Apr 2024 12:17:31 +0000

A Dive into Our Managed Services Program

Written by Fabian Duft

Introduction:

In today’s rapidly evolving digital landscape, businesses are continually seeking ways to optimize their operations while ensuring security and reliability. Managed services have emerged as a key solution, offering businesses the opportunity to offload their IT burdens and focus on core competencies. In this blog post, we’ll delve into a recent project in which we took over customer workloads for sport and finance products of one of the most visited websites in Germany, outlining our journey from initial meetings to customer onboarding and ongoing improvements.

1. Initial Meetings: Understanding Customer Needs

Our journey began with a series of initial meetings with the client to thoroughly understand their needs and workloads. We engaged in open dialogue to grasp the details of their existing infrastructure and operations. These discussions were crucial in laying the foundations for a successful partnership, allowing us to tailor our solutions to meet the specific requirements of the client.

2. Signing of the Contract: Strengthening the Partnership

Following discussions and negotiations, we reached a mutual agreement and signed the managed services contract. This enhanced our partnership and signaled kreuzwerker’s commitment to delivering exceptional managed services to our client.

3. Customer Onboarding: Setting the Stage for Success

Customer onboarding is a critical phase in any managed services engagement. It sets the stage for a seamless transition and ensures that both parties are aligned in terms of expectations and deliverables. Our onboarding process consisted of several key steps:

Infrastructure Assessments: We conducted comprehensive assessments of the client's infrastructure, spanning non-AWS infrastructure (external data providers for sports/finance data), AWS accounts, architecture diagrams, disaster recovery plans as well as access to relevant systems. These assessments provided valuable insights into the existing operations setup, allowing us to identify areas for optimization and improvement.

Security Checks and Improvements: Security is of great importance in today's digital landscape. kreuzwerker therefore conducted a thorough security analysis to identify vulnerabilities and implemented robust measures to improve the overall security of our customers' workloads. This included introducing best practices, implementing security protocols and reducing attack vectors against potential threats.

Customer Support Channels: We worked closely with the client to establish clear channels for customer support, including maintenance windows, service level agreements (SLAs), and communication in our service desk.
Additionally, we set up a dedicated 24/7 on-call support system to ensure prompt resolution on critical issues of the production workloads. For this, we analyzed the occurrence and impact of previous incidents independently and with the help of the former operations team. This analysis helped us fine tune existing monitors and set up new monitors in order to reduce false positives and focus on relevant alerts.

Improving Existing Workloads: In addition to the general onboarding tasks, we focused on improving the performance and security of the client's existing workloads. This involved

Updating software to the latest security releases
Implementing best practices based on security hub findings like

Disabling public access to S3 buckets
Closing inbound ports while making use of SSM and port forwarding

Using OIDC in CI/CD pipelines instead of static IAM credentials
Optimizing AWS costs

Taking over managed services is not just about taking over operations. It's about building trust, understanding details, and seamlessly integrating into our client's environment. Our onboarding process is a good example of this.

Two Months of Shadowing Phase: Learning and Collaborating
Upon initiation, we embarked on a two-month shadowing phase, working closely with the existing operations team. This period was important as it allowed us to get familiar with the client's environment, understand their workflows and gain firsthand experience with the challenges they faced. Collaborating side by side with the previous operations team enabled us to build rapport, foster mutual respect, and lay a solid foundation for the future collaboration.

Transition to Independence: Taking the Lead
After two months of intensive shadowing, we transitioned to a two-week period where the managed services team of kreuzwerker took the lead, with the previous operations team standing by to offer support if needed. This phase was a testament to our readiness and preparedness to assume full responsibility for the customer's workloads. With the support of our experienced team and the knowledge gained during the shadowing phase, we confidently navigated challenges, resolved issues, and ensured a smooth transition.

Operating Alone: Confidence in Our Capabilities
After the two weeks of having the lead, we operate independently, seamlessly managing our customer's workloads with precision and expertise. The journey from shadowing to independence was not just about acquiring technical skills. It was about building relationships, gaining trust, and demonstrating our commitment to meet the customer expectations, ensuring continued success for our client.

Conclusion:

Our managed services program has enabled us to successfully take over customer workloads for sport and finance products of one of Germany's most viewed websites. Through collaborative partnerships, comprehensive onboarding processes, and ongoing improvements, we have helped our client optimize their operations, enhance security, and achieve greater peace of mind. As we continue to evolve and adapt to the changing landscape of technology, we remain committed to delivering exceptional managed services that drive value and foster long-term success for our clients.

If you're interested in kreuzwerker’s MSP offering, you can find more details under: https://aws.kreuzwerker.de/en/managed-services

Prerequisites for building an efficient observability system

Public Cloud Group — Thu, 08 Feb 2024 09:48:57 +0000

What does a good observability system that serves your business needs cost?

Written by Serhii Hainulin

Software observability is crucial for ensuring optimal performance, reliability, and efficiency of your business's digital systems. In its broader sense, it encompasses all tools and activities related to controlling software system health and responding to failures and threats. As discussed in the post Software observability: it's not about sunk costs, it's about lost profit, investing in a robust software observability system is not just a technical necessity, but also a strategic move that directly impacts your business's profitability.

Building an effective and comprehensive observability system requires a thorough combination of numerous instruments, which provide real-time insights into the user experience, health, behavior and software application security. It also includes establishing routines that prepare and empower teams to act swiftly and precisely in case of emergency. The entire process depends on a significant amount of expertise and dedication, as well as recurring revisions and updates of every component. To better understand the required prerequisites and accompanying efforts, let’s start by separating them into two categories:

Organizational, which covers prerequisites concerning people and processes.
Technical, which includes prerequisites related to tools and technology.

Let’s look closer into each component and their associated costs.

Organizational

Cross-Functional Collaboration

It’s not so rare, when a company invests in a state-of-the-art observability solution, hires a top-notch DevOps team and nevertheless fails to mitigate a failure of their IT system due to a lack of collaboration between development, operation, compliance, product and other functions essential for observability systems. The permanent coordination between various stakeholders and delivery teams ensures that none of the essential requirements or threats is overlooked and that teams can effectively respond to issues. Creation of such a cooperative environment requires time and resources spent on fostering collaboration, potentially requiring cultural changes within the organization.

Skilled Personnel

No matter how much data a company collects, it’s useless without skilled personnel who are capable of making sense of it and acting on the extracted insights. Organizations need skilled personnel who understand observability concepts, can analyze telemetry data, and effectively use monitoring tools. A major prerequisite for building this competency is investing in training programs, hiring and retaining skilled personnel responsible for implementing and maintaining the observability system, as well as potentially outsourcing expertise.

Documentation and Knowledge Sharing

The bigger the company grows, the more challenging it is to build centralized logging and enable distributed monitoring in a consistent and cooperative manner. Many companies struggle to break the knowledge silos. These naturally when teams lack proper guidelines and fail to share knowledge within the team / or / each other. Clear documentation of observability practices are critical for maintaining a consistent and effective observability strategy. This, however, incurs time and effort spent on documentation, knowledge sharing sessions, and potential costs associated with maintaining documentation platforms.

Routines for Incident Management and Prevention

While frequently omitted when discussing observability, ability to remediate emerging issues fast, with minimal damage and maximal learning, is an essential function that represents the efficiency of the entire observability system. It includes a set of measures related to establishing routine and coordination, documenting and enforcing working procedures and run-books, etc. and incurs costs related to preparation, training and coordination, implementation of correct authorization and high transparency.

Security Measures

When talking about security measures in context of observability, we speak about proactive forensic activities that adhere to security best practices and facilitate early actions on dangerous human errors and socially-engineered threats, as well as guarantee an adequate reaction to the security issues. These measures require associated investments and depend on permanent internal reassessment, the help of specialized consultants and training of all people interacting with the software.

Technical

Instrumentation

The software applications and infrastructure need to be equipped to emit relevant telemetry data. This includes metrics, logs, and traces that provide insights into the application's behavior and performance. The implementation cost involves the development effort to instrument the code, integration with existing systems, and potential impact on application performance.

Centralized Logging and Storage

A centralized logging system is essential for collecting and storing logs from various components. Similarly, a scalable storage solution is needed for storing the vast amount of telemetry data generated. Associated costs include implementing and maintaining a robust logging infrastructure, storage costs, and potential expenses related to data retention policies.

Monitoring Tools and Platforms

The use of monitoring tools and platforms is crucial for visualizing and analyzing telemetry data. These tools should support real-time monitoring, anomaly detection, and alerting capabilities.

Costwise, they depend on licensing fees, subscription costs, and potential costs associated with training teams on the selected monitoring tools and any third-party services used for observability.

Distributed Tracing

For microservice architectures, distributed tracing is essential to monitor requests as they traverse through various services. This requires integration with each service to capture information and logs. Implementation of tracing depends on the development effort for integration, has potential impact on system performance, and costs associated with specialized tracing tools.

Security Measures

Observability systems must adhere to security best practices to protect sensitive telemetry data. This involves implementing encryption, access controls, and secure communication channels.

Costs related to implementing and maintaining security measures can be high. There is also potential impact on system performance, and ongoing efforts to stay compliant with security standards.

Conclusion

While investing in an efficient observability system is crucial for business success, it's essential to consider the technical and organizational prerequisites, along with associated costs related to establishing such a system and running it. The system must not only be technically robust but also yields maximum return on investment, while aligning with business objectives and adhering to resource constraints. Putting all facets of an observability system together is a complex undertaking that demands in-depth domain knowledge and extensive experience in the field, and it’s not always possible or worth it to spend time and money building the inhouse expertise from scratch, especially if observability is not a feature that you are going to sell to your customers.
In such cases, it is worth looking into involving kreuzwerker as a consulting partner that will help you to build a robust technical foundation and efficient business processes to guarantee that your core software systems never remain unattended. Our engineers and consultants have solid experience in creating, refactoring, or reviewing observability systems to ensure their efficiency, robustness, customer satisfaction and state of the art best practices.

Learn more about software observability:

Post 1: Software observability: it's not about sunk costs, it's about lost profit
Post 3: Coming soon

How cloud native are you?

Public Cloud Group — Tue, 06 Feb 2024 14:02:18 +0000

Cloud Native Maturity Index captures your score.

Written by Jean-Pierre Damen

kreuzwerker has developed a methodology to assess the cloud native maturity of businesses with substantial operations in the AWS Cloud. This methodology identifies gaps in cloud nativeness and determines if the right preconditions for full cloud native adoption are present. The result is a cloud native Score, subdivided into categories and measured against business values. This is combined with an extensive report that includes recommendations, prioritized actions, and KPIs.

This methodology evaluates the level of cloud native maturity of existing cloud operations, including both cultural components and processes that are essential for becoming cloud native. The goal is to quantify and capture the current maturity through a cumulative score. This score serves as a baseline for future assessments to track progress or for comparison against other companies. More details on the scoring system will be provided later.

Why Cloud Native?
The main idea behind being cloud native is to leverage the benefits of the cloud optimally, aligning them with business values and processes. This alignment translates into tangible solutions and improvements. Being cloud native enables a company to be more agile, faster, flexible, and innovative. It leads to shorter time-to-market, data-driven decision-making, and very lean, self-organizing processes.

Our methodology therefore offers a comprehensive view of cloud native maturity, focusing not only on technology but also on people, culture, product management, teams, and processes. This includes aspects such as decision-making, experimentation, collaboration, education, and knowledge sharing. But before we proceed, let's first define what 'cloud native' actually means. While there are many definitions available, we have distilled it to the following core concept:

Focusing on self-enabling teams through continuous automated delivery and testing with reduced operational overhead.

Definition
The above is a simple definition, yet it carries many implications. This seemingly straightforward statement can trigger a snowball effect, leading to numerous implicit consequences. The extent to which these consequences manifest within an organization defines its cloud native maturity. Below, I present a condensed overview of these consequences. But be warned – this explanation is not for the faint of heart.

The Snowball Effect
The brief statement “Focusing on self-enabling teams through continuous automated delivery and testing with reduced operational overhead“ carries significant implicit implications. Firstly, continuous delivery by multiple teams suggests that codependent workloads can be developed and deployed independently. This implies that a fault-tolerant delivery for codependent workloads is ensured through continuous testing, enabling teams to fully own the process of building and running their workload in productive environments. In turn, this implies that operational overhead is maximally reduced through automated processes. Continuous delivery also indicates that changes are rolled out in small batches to streamline the automated delivery and testing process. This approach means that testing becomes foundational to development, as code cannot be accepted without it. The emphasis on small batches of changes suggests that larger changes should be de-risked by first prototyping them and by gathering real-world metrics before full implementation. The practice of prototyping and comparing different options and outcomes hints at a culture of innovation, which in turn implies support for this culture at the organizational level.

Additionally, fully automated pipelines imply that in cloud-native development, infrastructure dependencies are covered or included in the development environment. This also means that the propagation to other environments, including production, takes these infrastructural dependencies into account. Ideally, infrastructural updates can be triggered from the application pipelines, necessitating the organization of infrastructure code into clear verticals to facilitate this process. For reduced overhead, services requiring minimal or no additional provisioning or configuration are preferred, hence managed services or serverless architecture should prevail. Additionally, this suggests that there are infrastructure pipelines in place to automate the rollouts of infrastructure components. This also implies that testing is in place to ensure these rollouts work as expected.

Business Relevance
We could certainly expand on this topic for quite some time, but this simply demonstrates that our streamlined definition implicitly encompasses the entire spectrum of cloud nativeness. It's important to note, however, that as with any all-encompassing definition, this represents an idealized, utopian situation. Real-world scenarios invariably involve trade-offs, dictated by practical limitations such as time, financial resources, and human capital.

This is why the methodology not only identifies gaps, but also allows for a partial focus on the most business-relevant topics. By mapping the scores to business values and prioritizing these values, it becomes easier to target specific areas for attention. This approach means that certain topics can be parked or even redefined according to specific business needs. Take, for instance, the cloud-native paradigm of using microservices. Not every company or solution benefits from a microservice architecture, especially if the effort is too great for a feasible payoff. In such scenarios, it does not make sense to place too much emphasis on this particular architectural decision, as it does not align with the business value.

Methodology
We have defined ten design principles that contribute equally to the cloud native maturity of an organization. As previously mentioned, our challenge was to encompass not only technology topics but also operations, processes, and less tangible aspects such as culture and collaboration. The latter are vital preconditions for establishing cloud native maturity. For an overview of the ten pillars, refer to figure 1.

Figure 1

Another challenge was to derive quantified results by generating an overall index score. A crucial aspect of this maturity assessment is enabling organizations to establish a baseline for comparison. This index can be used to take interval snapshots to track measurable progress.

To achieve this, each design principle was divided into paradigms, and each paradigm further into a set of best practices. The assessment is conducted through a series of interviews with a selection of stakeholders from all organizational ranks, primarily targeting those in roles with active and substantial expert knowledge.

The scoring is performed by presenting the best practices as propositions. Participants are then asked to rate the extent to which each proposition applies to their situation on a scale. Prior to the assessment, a careful selection of participants is advised, so topics are equally covered despite different exposure levels to avoid contaminating the score with uninformed responses.

The scores of the best practices can then be cumulated into paradigm scores, and these cumulative paradigm scores can be further aggregated into pillar scores, which in turn are combined to form the overall index score.

Additionally, extra weight can be assigned to individual best practices and paradigms. This is achieved by applying a multiplier, which gives certain best practice scores more weight than others within the same paradigm, or between paradigms within the same pillar. The process starts by defining business values that can be tagged to best practices and paradigms. The tagging system then automatically applies multipliers based on these prioritized business values. By emphasizing these business values, an alternative perspective can also be derived from the index, offering insights not just based on individual pillars, but across multiple pillars.

See an example in figure 2.

Figure 2 (to be graphically enhanced)

Business values should be derived from the organization’s mission and can be tailored to specific needs. Example business values that were used in the past are: Innovation and Experiment, Speed To Market (STM) , Collaboration and Productivity (CAP), Competitive Advantage, High Availability and Reliability, Global Reach, Security and Compliance, Reduced Time To Recovery, Reduced Operational Overhead, Resource Optimization, etcetera.

Conclusion
The kreuzwerker methodology for assessing cloud native maturity offers a holistic approach, encompassing technology, operations, culture, and more. With ten design pillars, detailed through paradigms and best practices, this method provides a structured, yet flexible framework for evaluation. The scoring system, adaptable with weighted factors based on business values, not only establishes a baseline, but also tracks progress over time. Acknowledging the aspirational nature of cloud nativeness, our methodology realistically addresses the trade-offs inherent in the cloud-native environment. It's a dynamic tool that guides organizations in understanding and enhancing their cloud capabilities, aligning with both current and future business and technological landscapes.

For a full demonstration or additional information please contact us: aws@kreuzwerker.de

Tags: AWS, Microservices, CloudNative, Maturity, Methodology, Automation, Infrastructure, BusinessValue, Evaluation

Software observability: it's not about sunk costs, it's about lost profit

Public Cloud Group — Mon, 22 Jan 2024 10:04:32 +0000

How software observability contributes to the overall business performance.

Written by Serhii Hainulin

Nowadays, it’s hard to imagine a business that is not dependent on IT. Be it a simple landing page or a complex ERP system, software has become a cornerstone of daily operations and very often the major business driver. It’s not so rare, though, that managers treat applications as a fixed asset: a one time investment that works for years and requires only rare periodical maintenance and upgrades. While it may be true for some basic and static tools, such a simplistic understanding creates a lot of frustration and problems for business managers and software operators alike.

There are several reasons for this. First, software is volatile by nature. It is made up of a bunch of loosely coupled, highly specialized components, which are normally run on a difficult to standardize hardware. Even extremely good applications are prone to periodical random failures and performance degradation. Second, it’s rare nowadays to use just one application. More often we rely on a complex software system consisting of dozens of applications that are not always 100% interoperational. Third, there is always room for overlooked human error or an edge case. Fourth, in the infamous triangle of price-time-quality, the latter is usually sacrificed. And add to this malicious actors, network disruptions, and all the rest of it.

Over the years the industry has developed various approaches to mitigate these factors, but none are a “silver bullet” that can guarantee the software will work flawlessly 100% of the time. And every failure causes downtimes, disrupted operations, “firefighting” and eventually lost profit and reputation that can be much more expensive than the direct costs associated with fixing the problem.
Although we cannot avoid software failures completely, there is a way to predict them and minimize the downtime and remediation costs. To do so we need full visibility of what is happening to our system, as well as being ready to act swiftly and precisely in case of emergency. All tools and activities related to controlling software system health as well as responding to failures and threats, are combined under the common notion of “observability”.

Software observability, from the business perspective, refers to the ability of an organization to gain insights into the performance, health, and behavior of its software systems and act on those insights in a timely and efficient manner. The collection, analysis and visualization of data of various software and hardware stack components ensures that the system operates smoothly and meets business objectives.
Due to its paramount importance, observability involves a very diverse set of measures and activities that proliferate and contribute to every aspect of the business operations. When performed correctly and maintained on a systematic basis, it often requires substantial and regular investments of time and money; but these expenses are offset by better user and employee experience, optimized operational costs, reduced emergency costs, reduction of downtime and lost profit, and stable reputation of the reliable partner. To display that, let’s look into the contribution of several individual constituents of observability to overall business performance.

Performance Monitoring

Objective: Ensure optimal performance of software applications.
Business Impact: Improved user experience, customer satisfaction, and efficient use of resources.

Error Tracking

Objective: Identify and resolve software errors and issues.
Business Impact: Minimize downtime, reduce support costs, and maintain a positive brand image.

Log Analysis

Objective: Analyze logs for troubleshooting, security, and compliance.
Business Impact: Enhance security, meet regulatory requirements, and streamline issue resolution.

Distributed Tracing

Objective: Trace transactions and requests across distributed systems.
Business Impact: Improved understanding of system interactions, faster problem resolution, and enhanced reliability.

Metrics and KPIs

Objective: Track key metrics and Key Performance Indicators (KPIs) for business goals.
Business Impact: Informed decision-making, resource optimization, and alignment with strategic objectives.

Real-time Monitoring

Objective: Monitor software in real-time to detect and respond to issues promptly.
Business Impact: Minimize downtime, improve service availability, and maintain a competitive edge.

Scalability and Capacity Planning

Objective: Plan for future growth and ensure the scalability of software systems.
Business Impact: Cost-effective resource allocation, efficient scaling, and better capacity utilization.

User Experience Monitoring

Objective: Monitor and optimize the end-user experience.
Business Impact: Increased user satisfaction, retention, and positive brand perception.

Cost Optimization

Objective: Optimize resource usage to control infrastructure costs.
Business Impact: Efficient use of resources, reduced operational expenses, and improved profitability.

Predictive Analysis

Objective: Anticipate and mitigate potential issues before they impact the business.
Business Impact: Proactive problem resolution, improved reliability, and enhanced overall performance.

Often overlooked, and having a reputation of a secondary cost center with a fuzzy justification, observability actually has a potential to become a huge competitive advantage when done right. It directly impacts profitability on all levels of business operations and is a crucial aspect for modern businesses, enabling them to maintain reliable, high-performing software systems that align with strategic goals, enhance user satisfaction, and contribute to overall business success.

Learn more about software observability:

Post 2: Prerequisites for building an efficient observability system
Post 3: Coming soon

Vulnerability Scanning Solution

Public Cloud Group — Thu, 18 Jan 2024 13:22:55 +0000

What approach do we employ to detect and mitigate system and network security vulnerabilities.

Written by Egi Taga

Customers are affected by availability and security vulnerability issues regarding their applications and infrastructure.
This blog describes how we at kreuzwerker combine scanning, monitoring and communication tools to identify and mitigate security findings in order to avoid exploits, compromised systems and unintended network exposure, which could potentially cause extensive financial and business damage to our customers.

Problem

During a well-architected framework review (WAFR), different severity findings were identified, prompting customer to reach out to us to shape their security journey. Security is now more essential than ever. During the last few years, the trend of global attacks has increased exponentially, causing huge costs for the affected companies.

Our approach

Our first step was to combine Amazon Inspector and Security Hub in order to provide a stable and continuous vulnerability scanning solution. In the second stage, we designed the notification workflows to send warning and critical alerts respectively to our Slack channel and monitoring platform, Datadog. This ensures centralized monitoring of all infrastructure components, efficient analyses, quick reaction and mitigation.

Amazon Inspector automatically discovers workloads such as Amazon EC2 instances, containers, and Lambda functions, and scans them for software vulnerabilities and unintended network exposure. For instance, HTTP service on TCP port 80 open to the world or AWS Internet Gateway.

Our daily approach to each finding starts with analyzing the infrastructure and application impact. Each vulnerability is already scored by the global authorities, but this doesn’t take into consideration the network and application restriction rules. Therefore, we:

Evaluate the security and compliance of our customer's AWS infrastructure
Help the customer to have a clear understanding of the impact the findings have in their environment
Advise what actions to take
Mitigate if a fix is not yet available
Remediate vulnerabilities.

After enabling the Inspector service, every newly created EC2 instance will be discovered and scanned automatically. On the other side, network reachability scans automatically take place every 24 hours.

Example case 1

For instance, CVE-2023-4863 has recently been identified in Windows instances. More precisely, heap buffer overflow in libwebp on Google Chrome prior to 116.0.5845.187 and libwebp 1.3.2 allows a remote attacker to perform an out-of-bounds memory write via a crafted HTML page. (Chromium security severity: Critical).

Solution

Since the implementation phase, we have employed SSM to shape the patch baselines, get latest patches automatically and maintenance windows based on customer's requirements. This ensured automated OS updates and remediation.

Example case 2

Apart from CVE findings, network scanning are triggered regularly to identify overly permissive rules. For instance, we have been warned in Slack of unrestricted access to ports with high risk. This control checks whether unrestricted incoming traffic for the security groups is accessible to the specified ports that have the highest risk. This control fails if any of the rules in a security group allow ingress traffic from 0.0.0.0/0 or ::/0 for those ports. In order to optimally protect the infrastructure from network exploits, we advised customer to remove such permissive rules in security group.

Conclusions

kreuzwerker leverages the benefits of Amazon Inspector, Systems Manager, Datadog and Slack to detect, notify and remediate security vulnerabilities.

We combine application security services with Terraform to provide versioning history, quick rollout and rollback. Each instance is automatically detected and scanned. Inspector triggers rescans of the instances just after the installation of new packages or after introducing new CVEs. Last but not least, this solution scans network reachability to identify any insecure port opened to the world, or for example, from the Internet Gateway:

Datadog Resource Inventory

Public Cloud Group — Tue, 16 Jan 2024 14:19:40 +0000

Security best practices should apply to all resources deployed in an AWS account.

Written by Saleh Parsa

Getting a complete overview of all existing and newly created resources, analyzing their security posture in real time and acting timely or even automated is vastly simplified using the Datadog resource inventory feature.

Problem

In the rapidly evolving landscape of cloud computing, organizations face several common challenges when it comes to managing their AWS (Amazon Web Services) resource inventory. Many organizations grapple with the limited visibility of their AWS resources, facing challenges due to the fragmented nature of their infrastructure. The absence of a consolidated user interface (UI) for resource inventory hampers efficient access to vital information. Resource tracking, involving categorization by type, region, and account, becomes a cumbersome task, contributing to inefficiencies in resource management and oversight. Additionally, organizations often lack an accessible mechanism to view key configuration attributes for each resource, impeding effective monitoring and management. Initial resource discovery within the AWS environment poses a challenge, requiring the implementation of mechanisms such as AWS describe APIs for accurate inventory building. Furthermore, the task of keeping the resource inventory up to date is a continual challenge, necessitating prompt identification of new resources, updates, or terminated resources for real-time updates.

Solution

To address the common challenges faced by organizations, Datadog's Cloud Security Management provides a comprehensive solution with the following capabilities. First and foremost, it offers a consolidated user interface (UI) that efficiently displays AWS resource inventory information. This UI allows users to access and organize data by resource type, region, and account, providing a centralized platform to view critical configuration attributes. Additionally, the solution tackles the challenge of resource discovery by leveraging AWS describe APIs, facilitating an initial exploration of resources within the customer's AWS environment and populating the resource inventory. To maintain an up-to-date inventory, Datadog's solution utilizes an event-driven framework. This framework continuously monitors and identifies new, updated, or terminated resources, ensuring prompt updates to the resource inventory for effective and real-time management.

The Result

By implementing integration between Datadog and AWS and enabling Datadog Cloud Security Management, organizations can achieve the following outcomes:

1. Efficient Resource Management

The consolidated UI allows organizations to access and manage AWS resources more efficiently. They can view inventory information based on resource type, region, and account in one place.

2. Resource Visibility

Users can readily access key configuration attributes and tags for each resource, which enhances resource visibility and aids in effective monitoring and management.

3. Resource Discovery

The solution's initial resource discovery capability ensures that the inventory begins with an accurate representation of AWS resources.

4. Real-time Updates

The event-driven framework ensures that the inventory remains up to date by identifying and reflecting changes to resources in a timely manner.

Conclusion

In conclusion, this case study highlights the value of using Datadog Cloud Security Management that addresses common challenges in AWS resource inventory management. By providing a consolidated UI, resource discovery through AWS APIs, and real-time updates, organizations can efficiently manage their AWS resources and enhance visibility, ultimately leading to better resource governance and control.

How-To: Use IAM credentials as verifiable tokens

Public Cloud Group — Thu, 11 Jan 2024 16:52:27 +0000

Utilizing Pre-Signed Calls to GetCallerIdentity as Authentication Tokens.

Written by Amanuel Mekonnen

TL;DR

If you're seeking a method to authenticate or authorize requests to your software system from clients with existing AWS credentials, you can employ pre-signed URLs as secure tokens for the GetCallerIdentity operation. Then you can query AWS using the pre-signed URL to retrieve details about the requester's identity on behalf of the signatory. The pre-signed URL is temporary and is only valid for the specific request, ensuring the safety of the client's credentials while securely sharing their IAM identity. AWS is responsible for validating the credentials, a process we trust them to handle accurately. If you didn’t get the TL;DR, please don’t be discouraged. In this blog, I would like to show you how to build your authentication mechanism that reliably verifies IAM identities. I will also get into the details of how it works, including what HMAC authentication is and how AWS SigV4 utilizes it. Last but not least, I will discuss how we can leverage pre-signed URLs as a proxy for an identity token. Let’s get to it.

Note: We are largely talking about stateless requests. Traditional applications will use session-ids to carry authentication over across multiple requests.

Problem Statement

It is a bright spring day, and you are taking a walk in a park. You come up to a beautiful pond; the water is clear, and you can see the fish swimming, their figures hovering over the rocky bottom. The rays of light marbling the shiny stones through the waves pull your thoughts to your most ambitious software yet. It has been months in the making, and everything looks perfect. You've created a shiny new TCP protocol in a shiny new data center, like the stones in the pond below. All of a sudden, it occurs to you that all your clients are running on AWS. You wave the thought away: "I don't need to build a lot of tooling for security; AWS will take care of that for me." You know deep down that you don't want your clients to go through the hassle of managing credentials; they should somehow be able to use their IAM credentials to identify themselves. But how?

Diagram 1: Problem statement, how to convert IAM credential to a token

AWS has become ubiquitous these days. Even when applications do not run entirely in AWS, some workloads are dependent on it, whether the application is using S3 as storage or is utilizing one of the many machine learning offerings from AWS. And because one needs credentials to access and manage AWS resources, IAM credentials are as ubiquitous as AWS itself. It makes sense to support login with IAM. If your software is running on AWS, say behind an API gateway, AWS might be able to help. Especially if you are using REST API on API Gateway, you are taken care of. However, if your application is running on the HTTP counterpart there is some gaps to fill. More specifically, cross-account IAM authentication on HTTP APIs is only supported through assuming roles.

What I would like to help you solve today is generic. Simply put, I want you to be able to authenticate clients using their IAM credentials. What you need is to somehow convert their IAM credentials into a token, something that resembles a JWT. Clients would need to share their IAM identities without exposing their actual credentials. For obvious reasons, the client sharing their actual IAM credentials is out of the question. That would be plain dumb. The tokens should safely identify the client without exposing their secure information.

On the other side, your software will need to validate these credentials in a trustworthy manner.

Diagram 2: Identity federation flow

If you look closely and add some imagination, the first diagram (Diagram 1) looks like identity federation. The diagram above (Diagram 2) shows how a typical flow works in federated identity provisioning. Once a user authenticates themselves to an identity provider, they get a token in exchange. This token can take different forms. One of the popular means is the aforementioned JWT token. This token is able to identify a user. The service (your shiny new software) will then use mechanisms provided by the identity provider to validate the token. AWS Cognito or AWS Identity Center support utilizing these means to provision users with IAM identities. You can use your Active Directory, Google Account, or any other compatible identity provider to acquire IAM credentials. What you need now is the opposite use case. You would like to exchange IAM credentials for a verifiable token. This will effectively turn IAM into an identity provider.

Solution

While there is plenty in the way of converting various identities (the likes of Google, Facebook, and Active Directory) into an IAM credential, AWS does not (yet) provide any official means to convert IAM credentials into a verifiable token. At least I couldn't find it. So, I went looking and found something that works very well. In fact, that is how MongoDB atlas and Vault support authentication with AWS IAM. It boils down to pre-signed URLs and the GetCallerIdentity API operation.

I am sure everyone who worked with the AWS CLI used

aws sts get-caller-identity

command to either make sure we have the right credentials set, or to validate you have any credentials at all. The GetCallerIdentity endpoint returns the principal arn (the assumed role arn or user arn), account, and user id of the caller. Here is an example response in JSON.

    {
  "GetCallerIdentityResult": {
    "Arn": "arn:aws:sts::123456789:assumed-role/TestRole/amanuel.mekonnen@kreuzwerker.de",
    "UserId": "IDONTKNOWIFISHOULDSHARE:amanuel.mekonnen@kreuzwerker.de",
    "Account": "123456789"
  },
  "ResponseMetadata": {
    "RequestId": "<some-uuid>"
  }
}

When calling AWSs APIs, you must authenticate by signing the request using your IAM credentials. This is what AWS calls AWS SigV4. The AWS CLI and AWS SDKs do this for you behind the scenes. AWS SigV4 is a version of HMAC, Hashed Message Authenticated Codes. HMAC is an implementation of MAC with additional security measures.

HMAC is based on creating a signature that achieves two goals. It acts as an integrity check as well as authentication. In the next section, I will give you a summary of the AWS authentication and pre-signed URLs along with some clue about HMAC. If you already know about HMAC and pre-signed URLs, please skip right over to the Implementation Section.

Concepts

HMAC

To make a point for HMAC, we first need to discuss some other methods of authentication. A very basic means to authenticate yourself when making a request is to use the appropriately named Basic authentication scheme (RFC 7617). You have a set of credentials, a user identifier (username, email, etc), and a secret shared key (password, API key, etc). In the basic authentication scheme, you simply share your credentials along with your request. This, however, is very insecure as the user identifier and secret are passed over the network in an unencrypted form.

Assume the following:

Client → “Hello”[Base64(username:password)] → Server.

In this interaction, if an attacker was able to get the request (by eavesdropping, DNS spoofing, or maybe from exposed logs on the server side) the attacker gains access to the plaintext password. This is clearly an issue. This does not just apply to Basic Authentication. API token authentications are also a version of Basic authentication. Bearer tokens, while having an important distinction, are also vulnerable for the same vulnerability. Secret information is shared in unencrypted form.

To avoid this most serious flaw of Basic authentication, the internet community defined the Digest Access Authentication scheme (rfc2617). Digest authentication avoids passing the password in clear text. Instead a checksum or digest (by default MD5) of of the username, the password, and server defined nonce value, the HTTP method, and the requested URI. The nonce is uniquely generated each time a 401 response is made. While this is better than Basic, it is not entirely secure. The hash function generates a different opaque value different nonces, request paths, and http methods. This helps mitigate some playback attacks.

Diagram 3: Digest Authentication Scheme exhanges

Side note: A playback attack (replay attack) is an attack in which a malicious entity eavesdrops on the communication and sends the recorded authentication header as-is without the need to understand or modify the content.

Diagram 4: Playback attack Source: Wikipedia

While it is better than Basic, the Digest authentication comes with a couple of issues. It is not stateless as it requires the nonce to be server-generated. This is especially relevant for stateless HTTP APIs (commonly referred to as REST even when they are not RESTful) that would become much slower with the additional roundtrip. The other problem is that the digest remains the same across requests to the same HTTP method and request URI as long as the nonce from the server has not changed. Hence, there is a window of time where a malicious attacker can use a recorded opaque to send malicious messages. In other words, the digest cannot be used as an integrity check.

The internet never sleeps and thus we (as if I helped) came up with another way to make it more secure. Message Authentication Code is the way to verify the integrity of messages sent over the wire. I will explain the more specific type of MAC, HMAC. HMAC adds on and re-organizes the digest authentication in important ways.

HMAC still requires pre-shared keys as with other mechanisms. It uses a similar technique as a digest to hide the password in that the secret key (password) is only used in deriving the authentication code. However, instead of a server-generated nonce, we will be d the key from the contents of the request. A typical HMAC function will use a hash function (SHA-256 for example) and generate a checksum/signature that contains important parts of the request. It can looks something like this

**Hash**(secret + **Hash**(secret + request_uri + post_data + http_method + timestamp)).

The data that is part of the checksum needs to be pre-communicated. By adding values that are unique and expire quickly (here: the timestamp), the checksum can be made effective in making the integrity check reliable.

How Pre-Signed Requests Works

AWS SigV4 is an implementation of the HMAC authentication. As you probably already know you get AWS programmatic credentials in the form of an Access Key, Secret Access Key, and optionally for temporary credentials a Security Token. When sending a request to AWS’s APIs, you create a signature derived from the Endpoint specification, Action, Action parameters, Date, and Authentication information. If you are using the AWS SDK or the CLI, you will not need to do this yourself. This is happening behind the scenes. You can refer to the AWS Documentation for more information on how the signature is calculated.

Because the signature is made unique for a combination of parameters such as HTTP Verb, Query string and headers passed along the timestamps when the signature is generated. This is HMAC at work. At the receiving end, AWS will fetch the secret key using the Access Key provided in the request. AWS will run the same algorithm on the request and calculate the signature. A matching signature will tell AWS that the request has not been tampered with, it is not a signature copied from an older request, and the requester has valid credentials.

Every request sent to an AWS endpoint requires a signature that is calculated as above. It is usually sent in the Headers sections. However, it can also be sent as part of the query strings. Since the signature is irreversible and does not expose the secret key, it is safe to share. If you would like someone to perform a specific operation on your behalf, you can share a pre-signed request.

A typical usage of this is uploading and downloading from and to S3. You generate a pre-signed S3 upload or download URL. This URL will only work to perform the given operation on your behalf without making the bucket publicly accessible. It boils down to calculating the signature using your credentials and sharing the signature. Whoever sends the request using that signature is calling AWS as if it were you.

Basic Implementation

If someone shares a pre-signed URL for sts:GetCallerIdentity, then you can execute the API call to get their identity. In effect, this becomes your verifiable token.

The Client

In the client, create a method to get the authorization token (the pre-signed URL for us) as follows.

public static String getAuthorizationHeader() {
  DefaultCredentialsProvider provider = DefaultCredentialsProvider.create();
  final Aws4Signer aws4Signer = Aws4Signer.create();

  final SdkHttpFullRequest httpFullRequest = SdkHttpFullRequest.builder()
          .host("sts.amazonaws.com").port(443)
          .protocol("https").method(SdkHttpMethod.POST)
          .putRawQueryParameter("Version", "2011-06-15")
          .putRawQueryParameter("Action", "GetCallerIdentity")
          .build();

  final SdkHttpFullRequest presign = aws4Signer.presign(httpFullRequest,
          Aws4PresignerParams.builder()
                  .awsCredentials(provider.resolveCredentials())
                  .signingName("sts")
                  .signingRegion(Region.US_EAST_1)
                  .build());

  return presign.encodedQueryParameters().orElseThrow();
}

In this method, I am creating a call to the global sts endpoint. In the httpFullRequestobject, we are specifying what the request looks like and what parameters it will contain. Then we can pass this to the

`Aws4Signer` to be signed. I am using the AWS SDK. However, if you would like to avoid that, you are able to do the same with a custom implementation. It only requires some lengthy code`

Note: If you are worried about performance (as a good engineer is), signing is done completely offline as the calculation of the signature does not involve AWS.

Then when calling your endpoint, add the pre-signed uri as a header to the http request. In Spring RestTemplate, you can do it as such.

new RestTemplate()
        .exchange(
            RequestEntity.get(URI.create("https://example.com/api/echo"))
                .header("Authorization", getAuthorizationHeader())
                .build(), 
            String.class);

On Your Server

On your server, you need to get the authorization header and call the endpoint. This code is typically part of you security filter. I’m using Javascript to verify this. I originally wrote this solution to run in an AWS lambda function as a lambda authorizer. You can do a similar thing for Spring Security or any other framework.

const query = new URLSearchParams(authHeader);
if (query.get('Action') !== 'GetCallerIdentity') {
    throw "I know what you maybe trying to do, we are not going there"
}

const options = { hostname: 'sts.amazonaws.com', path: '/?' + query, method: 'POST', headers: {'Accept': "application/json"} };

return req(options) //req is a function I copied from here > https://gist.github.com/ktheory/df3440b01d4b9d3197180d5254d7fb65
    .then(({body}) => {
        const principal = body.GetCallerIdentityResponse.GetCallerIdentityResult.Arn;
        const account = body.GetCallerIdentityResponse.GetCallerIdentityResult.Account;

        return isAllowed(account, principal, requestedPath)
    })
    .catch(response => {
        console.log("Failed", response)
        return false
    })

Note: I am defining the host as sts.amazonaws.com and not receiving a full url from the client. We wouldn’t want to call an attacker-controlled host. Hence, it is important to only accept the query string from the client. If the client sends a rogue string, we trust AWS to handle it.

The auth header contains the query string part of the request from our client. We will now execute the HTTP call to AWS. If we get a valid response (like the JSON shown above), we now know that AWS has verified that the pre-signed request is valid and that it was signed by amanuel.mekonnen while assuming a role called TestRole. We also know the AWS account 123456789. The logic to allow or deny the request can be evaluated based on these details.

What is missing?

The above implementation has some gaps. For some low-stake situations, it is sufficient as long as you are in a secured connection (HTTPS for example) and you do not log passwords.

When the stakes are high, you should know that the basic approach is susceptible to the infamous playback attack. The pre-signed request is valid for a long period of time (as long as the assumed role is valid). That means, once an attacker gains access to this header, multiple calls can be made to our endpoint impersonating poor amanuel.mekonnen (shoot him a message on LinkedIn BTW).

How do we improve this?

We need to make sure the signature expires quickly and that it is unique to requests. For that, we can leverage the query string to be more specific. AWS can be instructed to take more parameters into consideration when validating the signature. Taking advantage of that, here is a modified version of my code that add integrity check.

public static String getAuthorizationHeader(String requestedPath) {
  DefaultCredentialsProvider provider = DefaultCredentialsProvider.create();

  final Aws4Signer aws4Signer = Aws4Signer.create();

  final SdkHttpFullRequest httpFullRequest = SdkHttpFullRequest.builder()
          .host("sts.amazonaws.com")
          .port(443)
          .protocol("https")
          .putRawQueryParameter("Version", "2011-06-15")
          .putRawQueryParameter("Action", "GetCallerIdentity")
          .putRawQueryParameter("X-Auth-Timestamp", String.valueOf(Instant.now(Clock.systemUTC()).getEpochSecond()))
          .putRawQueryParameter("X-Auth-Path", requestedPath)
          .method(SdkHttpMethod.POST)
          .build();

  final SdkHttpFullRequest presign = aws4Signer.presign(httpFullRequest,
          Aws4PresignerParams.builder()
                  .awsCredentials(provider.resolveCredentials())
                  .signingName("sts")
                  .signingRegion(Region.US_EAST_1)
                  .build());

  log.info("Pre-signed {}", presign.getUri());

  final byte[] base64 = Base64.getEncoder()
          .encode(presign.encodedQueryParameters().orElseThrow().getBytes(StandardCharsets.UTF_8));

  return new String(base64, StandardCharsets.UTF_8);
}

Note: I have also encoded the result into Base64 to make it look more token-like.

Now with that modification in place in the client, we can validate a couple of more items in our security filter. We will verify that the requested path matches as well as that the signature was not made too long ago. More variables can be added to your liking.

const query = new URLSearchParams(Buffer.from(authHeader, 'base64').toString('utf8'););
if (query.get('Action') !== 'GetCallerIdentity') {
    throw "I know what you maybe trying to do, we are not going there"
}

if (query.get('X-Auth-Path') !== requestedPath) {
    throw "Tempering detected, I got you"
}

const currentEpochSeconds = Math.floor((new Date()).getTime() / 1000);
if ((currentEpochSeconds - query.get('X-Auth-Timestamp')) > 10) {
    throw "It has been over a while or two since this request was signed."
}

const options = { hostname: 'sts.amazonaws.com', path: '/?' + query, method: 'POST', headers: {'Accept': "application/json"} };

return req(options) //req is a function I copied from here > https://gist.github.com/ktheory/df3440b01d4b9d3197180d5254d7fb65
    .then(({body}) => {
        const principal = body.GetCallerIdentityResponse.GetCallerIdentityResult.Arn;
        const account = body.GetCallerIdentityResponse.GetCallerIdentityResult.Account;

        return isAllowed(account, principal, requestedPath)
    })
    .catch(response => {
        console.log("Failed", response)
        return false
    })

Note: The more specific the parameters are the less surface there is to do a playback attack. You can go further and make each request unique by generating a random value that would be shared through the headers.

Conclusion

In the blog, I've discussed some key concepts such as the Basic Authentication Scheme, Digest Authentication Scheme, and HMAC. Working on top of our knowledge of how that works, we have discussed how AWS used HMAC and how we can utilize the pre-signed uri to act as a proxy for identifying a client's IAM identity. We have also touched up on how we can enhance it to mitigate man-in-the-middle attacks, more specifically playback attacks.

The basic solution is suitable for most scenarios running in a secure context such as HTTPS. If passwords are leaked into your logs, however, it is vulnerable to exploits by whoever can read these logs. You can mitigate these by adding the improvements I suggested to support integrity checks with virtually no additional cost.

I will follow up with a smaller blog to show a more specific example including code on how to implement it on your API Gateway. In the meantime, If you have suggestions on how it can be improved or even how it may be done better, please reach out.

AWS CloudDay: Switzerland isn't in the clouds yet

Public Cloud Group — Tue, 17 Oct 2023 07:51:06 +0000

*Or how the Cloud Day in Zürich highlighted the Swiss companies' cautious approach to the cloud *

_Written by @samudurand _

Recently, on the 26th of September, I attended the AWS CloudDay in Zürich as part of the kreuzwerker booth crew. The event was packed with interesting talks and fun activities, including keynotes, a football kicking challenge, and various other partner sponsored events. As I expected, the whole day was pretty busy due to the many visitors coming to visit our stand, asking about our services and how we could help them solve their current challenges.

However, unlike many other conferences I took part in previously, most visitors were either students, engineers, or independent consultants. Very few decision makers and stakeholders, such as CEOs, CTOs, startup owners or even simple tech and team leads, seemed to be present. Talking with our visitors quickly revealed that most were in one of two situations:

they knew about the cloud, but weren't using it in their daily jobs
they had heard about the cloud, but didn't know much about it

Most wished they worked with cloud technologies, at least for part of their workload, but almost none currently did. Many were really suffering from the lack of flexibility (in terms of resources, scalability and technology) and from the other typical restrictions inherent to an on-premise installation. When I suggested that they might be able to influence their higher-ups to give the cloud a try, I often received a "no way that's happening anytime soon" response, which was a much more intense and definitive reaction than I usually get in other circumstances.

When I asked why this was so unlikely to happen, I got similar answers:

there were privacy and security concerns since they worked in domains with sensitive data, such as financing, banking or health care
the data location had to be known and the access very strict
the on-premise infrastructure has existed for years and is really reliable
the simple idea of the cloud was enough to make their superiors very uncomfortable, just because it was "the cloud, this mysterious and unknown thing"

Those answers prompted me to look up some statistics about the current cloud usage by Swiss companies, and indeed there seems to be a strong "fear of the cloud" feeling still present today in many Swiss-based businesses. The figure below, which uses data from 2021, illustrates this situation. This fits with the picture I could draw from the conversations I had that day.

Source: IDC Whitepaper, July 2021, IDC European Infrastructure and Multicloud Survey

It's crucial to understand that the cloud's benefits often outweigh the perceived risks. For starters, cloud providers such as AWS invest significantly in the latest security and compliance requirements, ensuring that data remains safe and adherent to strict industry standards. This can make it even more secure than many on-premise solutions, where it is difficult to keep up with the ever-changing technological and security landscape. Moreover, the cloud offers geographic flexibility by allowing data to be stored in specific regions, catering to data residency requirements, especially now that AWS has deployed a region in Zürich.

Additionally, while on-premise infrastructures may have stood the test of time, the cloud introduces cost-efficiency, scalability, and flexibility, which can be invaluable for businesses aiming for growth and adapting to change. One can start small and scale resources up or down based on the needs, without incurring unnecessary expenses on unused capacities. Lastly, the "mysterious" nature of the cloud can be demystified through education and training. As more individuals and decision-makers become familiar with its potential, the cloud's advantages become evident, rendering it not so hard to grasp after all.

Conclusion

The AWS CloudDay in Zurich is always a great event to attend. It provides opportunities for new leads, but also allows us to get a feel for the current state of the Swiss cloud landscape, which is currently relatively small (compared with other regions of the globe). It will take some work to make that mentality change, but that's our job at kreuzwerker. We support a wide range of customers, from startups to SMEs, in all phases of their cloud transformation journey: from designing, building, migrating, modernizing and optimizing their cloud landscape.

Practicing the Principle of Least Privilege

Public Cloud Group — Tue, 17 Oct 2023 07:45:20 +0000

Access control in multi-tenant applications using tagged sessions and IAM permission policies

Written by Sharib Jafari

Principle of least privilege

If you've been working with security or cloud or security in the cloud, you probably already know about the principle of least privilege. Simply put, it states that each entity must be able to access only the information and resources necessary for its legitimate purpose.

As simple as it sounds, adhering to this principle can be challenging. There are multiple ways and levels in which you can apply this principle. My last article was about strategies to isolate the compute and data resources to achieve access restriction and tenant isolation, but isolating shared resources can get tricky.

Access control for shared resources

There are many reasons for sharing AWS resources. For example, you might want to use a single S3 bucket for storing objects for all users instead of creating a dedicated bucket for each user and risk running out of the quota (currently 100 buckets per account). Similarly, you might want to access items from a single DynamoDB table instead of provisioning a new table for each user.

In such a case, it's normal for developers to grant access to the entire bucket or table. As a result, isolation is hard-coded in the program that accesses these resources to ensure it only requests the objects and items corresponding to the current user. While this solution may work, relying on the developers to access only the relevant items in good faith is not the most efficient way to ensure the principle of least privilege. It is especially true if you are in a regulated industry with strict security and compliance requirements. There are much better ways to embed such fine-grained access control. Let's discuss one such approach.

TL;DR

If you're already familiar with the AWS services discussed later in the article, here's the approach in a nutshell. Continue reading if the following steps aren't clear yet.
To provide a fine-grained access control

configure a Cognito Identity Pool to allocate an IAM role to authenticated users.
configure Identity Pool to tag user sessions with OpenID claims using the "Attributes for access control" feature.
use Principal Tags inside the permission policies of the IAM role to restrict access.
to access resources on behalf of a user, exchange the identity for credentials using Identity Pool and use these credentials to access resources.

Scenario

Let's consider a scenario to help us better understand the approach we will shortly discuss.

Imagine you are building a multi-tenant application that allows its users to store and retrieve objects to and from an S3 bucket. Your application also allows the users to store some configurations as key/value pairs to a specific DynamoDB table.

You are using the Amazon Cognito User Pool for authentication and authorization to the endpoints of your API Gateway.

Basically, when a user calls your API, they provide an authorization header used by the Cognito User Pool to authenticate the user. After authentication, the API invokes a lambda that, based on the request payload, allows the user to perform certain operations on an S3 bucket and DynamoDB table. All users of the application share the same bucket and table.

The objective is to ensure that, instead of having access to the entire bucket or table, each user only has access to specific objects and items created by/for them.

Amazon Cognito Identity Pools

One important service that can help us achieve this fine-grained access control is Amazon Cognito Identity Pools.

An identity pool can be linked to your identity provider (Amazon Cognito User Pool in our case) and then be used to allocate an IAM role to an authenticated user. The way it works is that you can configure an IAM role as the "Authenticated Role" that can be assumed by the authenticated users.

For our scenario, this role will have permission policies to grant access to the S3 bucket and the DynamoDB table. Once the user authenticates to your service via the Cognito User Pool, you can use the identity token provided by the user pool and exchange it for temporary AWS credentials. With these credentials, you can have your service assume the Authenticated Role and access the protected resources.

Permission Policy for S3

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

Permission Policy for DynamoDB

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DeleteItem",
                "dynamodb:GetItem",
                "dynamodb:UpdateItem"
            ],
            "Resource": "arn:aws:dynamodb:REGION:ACCOUNT_ID:table/TABLE_NAME"
        }
    ]
}

If you are wondering how exactly it provides fine-grained access, that's because it doesn't...yet. With these policies in place, an authenticated user will still have access to all the objects inside the S3 bucket and all the items inside the DynamoDB table, via our application. At this point, one might stop relying on policies to restrict access and instead rely on the code to access the correct object or item. Let's take this solution further to restrict object access at the policy level.

Tag user session with claims

Amazon Cognito Identity Pool allows you to create key-value attributes using the claims from your identity provider and tag the current user session using these tags.

In the example below, you can see the default claim mapping that tags each user session with "username:sub" where sub is the Subject claim of OpenId and is usually a unique key that you can use to identify individual users.

Make sure to update the trust policy of the authenticated IAM role as shown below for this mapping to work.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "cognito-identity.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRoleWithWebIdentity",
                "sts:TagSession"
            ],
            "Condition": {
                "StringEquals": {
                    "cognito-identity.amazonaws.com:aud": "IDENTITY_POOL_ID"
                },
                "ForAnyValue:StringLike": {
                    "cognito-identity.amazonaws.com:amr": "authenticated"
                }
            }
        }
    ]
}

With this configuration, you can distinguish between the various user sessions, established using the temporary credentials and provided by the Identity Provider. You can map each session to individual users even though they all assume the same authenticated role. It is crucial for the next part as we'll update the previously listed permissions slightly to restrict access to only the objects and items that belong to or correspond to the current user.

Updated Policies

Updated Permission Policy for S3

Amazon S3 has a flat structure instead of a hierarchy. However, it can mimic the concept of a folder using a shared name prefix. For example, you can group two separate objects by simply naming them "GROUP_1/OBJ_1" and "GROUP_1/OBJ_2" where "GROUP_1" is the prefix and "OBJ_X" is the name of the individual object.
You can leverage this to provide fine-grained access control by grouping all the objects belonging to a user under the same username prefix and modifying the "Resource" attribute of your S3 permission policy as shown below.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/${aws:PrincipalTag/username}/*"
        }
    ]
}

The above policy ensures that users can access only the object group with the same prefix as the username used to tag the current session. It makes it impossible for any service that has assumed the authenticated role to access S3 objects with a different username prefix.

Updated Permission Policy for DynamoDB

DynamoDB has the concept of a partition key that can identify the items inside a table. You can also have a partition-key and a sort-key to create a composite primary key. For example, if you are storing the app-specific configurations for each user inside a DynamoDB table, you can have a composite key made up of username as the partition key and app name as the sort key. This way, the table will look something like the one below.

Once you introduce a username partition key, you can modify the DynamoDB policy of the authenticated IAM role to include a condition. This ensures that the services that assume this role have access to only the items whose partition key matches the username tag of the current session.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DeleteItem",
                "dynamodb:GetItem",
                "dynamodb:UpdateItem"
            ],
            "Resource": "arn:aws:dynamodb:REGION:ACCOUNT_ID:table/TABLE_NAME",
            "Condition": {
                "ForAllValues:StringEquals": {
                    "dynamodb:LeadingKeys": "${aws:PrincipalTag/username}"
                }
            }
        }
    ]
}

Conclusion

Applying the principle of least privilege requires extra care when sharing resources among multiple tenants/users. Knowing how to allocate IAM roles to authenticated users and tagging the user session with claims, provides us fine-grained control over resource access. In this article we explored how to use this method to share the same S3 bucket and DynamoDB table. The same technique can be applied to control access to several AWS services.

If you found this post insightful and want to stay updated with more tips and experiences, I invite you to follow me on Medium, LinkedIn for future content. Let’s continue this journey of growth and success together!

Achieving successful SaaS transformations

Public Cloud Group — Mon, 25 Sep 2023 14:28:14 +0000

Key Success Factors to keep in mind when developing your SaaS product with AWS

Written by Gopal Tayal

Introduction

Have you ever wondered why you no longer install software on your computer like you used to in the earlier part of this century? The answer: SaaS. Software as a Service (SaaS) is a modern way of obtaining and using software. It's like renting software online instead of buying physical copies. SaaS has gained popularity since the 2010s because it simplifies IT management and reduces costs when compared to traditional software as it is usually available to customers on a subscription or pay-per-use basis. But, since users can easily switch to other options, it's essential to develop your SaaS offering swiftly and to ensure a good user experience. In this blog, we'll give you some tips on how to successfully transform your SaaS application to gain a competitive edge.

Ensure frictionless onboarding

Put yourself in your customer's shoes and imagine purchasing a new SaaS product and wanting to use it right away. However, if it takes hours or even days to get started, it can be frustrating. Users prefer a seamless transition from being a new customer to actively using the system without feeling the onboarding process. By designing an architecture that ensures a smooth and repeatable onboarding process, SaaS companies can efficiently introduce new users to their system.

In the AWS environment, Cognito, with its user pools and identity pools, can assist in onboarding new users and granting access to specific application features based on their subscription tiers while IaC with AWS CDK can be used for automatic resource provisioning for tenants. This not only supports growth, but also accelerates the time it takes for customers to see value.

Maximize observability

A single outage or application health issue can affect all tenants, potentially causing a service disruption for all customers. SaaS organizations need to prioritize creating an operational experience that allows operations teams to efficiently handle the dynamic workloads of a SaaS environment. While maintaining an overall view of system health is important, it's equally crucial for the operations team to monitor individual tenant health and of their respective tiers.

To achieve this, logs and metrics should include tenant context details, such as tenant identifiers and tiers, to ensure effective monitoring and troubleshooting. The first step would be instrumenting the application (ensuring metrics are collected and sent) and AWS X-Ray is a cloud-native option for this.

Besides pro-actively solving faults, observability in a SaaS environment offers several other advantages as well:

Responsive Feature Development: Observability enables rapid responses to market and user needs by leveraging metrics and tenant-centric views. This includes tracking user interactions like page load times, clicks, and REST calls. CloudWatch metrics and dashboards can be easily configured to provide these insights. QuickSight is also a valuable tool for data analysis on the collected metrics, which can enable organizations to enhances reliability, scalability, cost-effectiveness, and agility.
Enhanced User Experience: Monitoring resource consumption (such as storage and compute) allows for proactive resource provisioning via auto scaling groups to prevent issues (Noisy Neighbours) and ensure a smooth user experience. These metrics can also be incorporated as a part of scaling policies to help manage resource consumption efficiently.

Provide accurate bills

Metering and billing rely on per-tenant usage data, which can be efficiently tracked using tagging practices with Cost and Usage Report. Serverless architectures with AWS Lambda ensure costs are directly correlated with tenant consumption. AWS Fargate (with EKS or ECS) provides a container-based option for this serverless approach, ensuring you only bill for the compute resources consumed.

Limit customisations

In the world of SaaS, customers often seek specific customizations. However, it's important not to get carried away and transform your application into a unique tool for each customer as soon as the money starts flowing in. To maintain control and best practices, it's vital to understand your application's capabilities and use feature flags for new customizations.

Feature flags, a common tool among developers, allow multiple paths of execution within a shared code base. Each flag enables or disables different capabilities at runtime and can be linked to tenant configuration options. The configurations are evaluated at runtime and determine the features enabled for each tenant, with Parameter Store serving as a useful storage solution for such configurations.

The key takeaway here is that - even if a single tenant needs a unique feature - that feature should be introduced as a customization to the core platform.

Test, test and test!

Efficient tests are vital for meeting consumer standards in your SaaS product. In a SaaS environment, consider the following tests:

Cross-Tenant Impact Tests: Simulate scenarios where some tenants put an excessive load on your system to assess its resilience and how it impacts the performance for the other tenants.
Tenant Consumption Tests: Develop various load profiles (e.g., steady, intermittent, and random) to analyze resource usage. This helps to identify discrepancies between actual and expected consumption, which can also help fine tune resource provisioning.
Tenant Workflow Tests: Focus on key workflows within your solution and test them with multiple tenants using the same workflow at the same time to spot bottlenecks or resource allocation issues.
Tenant Isolation Testing: Continuously validate the policies and mechanisms that secure each tenant's data and infrastructure.

Integrate these tests into your automatic deployment pipelines using AWS CodeBuild and AWS CodePipeline for streamlined quality assurance.

Summary

In this post, we have presented the knowledge gained from successful SaaS transformations on AWS to assist you in doing the same. By embracing this, SaaS teams can gain a competitive advantage, and at kreuzwerker, with the help of AWS, we can support you in achieving this goal.

If you found this article useful or If you have more tips and tricks, please let us know; we'd be happy to hear from you. If you want to hear more from us about SaaS, check out some of our other blogs on the topic.

Thanks for reading!

Isolating AWS Resources for a Secure Multi-Tenant SaaS

Public Cloud Group — Mon, 25 Sep 2023 14:14:31 +0000

The need for tenant isolation and various strategies for isolating AWS resources

_Written by @sharibj _

Introduction

In the dynamic world of Software as a Service (SaaS), the concept of multi-tenancy has become a game-changer. It allows SaaS providers to efficiently serve multiple customers or tenants using a single application, effectively streamlining operations and reducing costs by pooling resources. However, beneath the surface of this efficiency, lies a critical imperative: the need to implement robust isolation strategies. These strategies are vital to ensure the security, compliance, and strategic growth of your SaaS application.

In this blog we'll explore why isolating your tenants is essential and delve into different isolation strategies, focusing on isolating Amazon Web Services (AWS) resources.

Why should you isolate your tenants?

Security

Imagine a scenario in which tenants within your SaaS application inadvertently access each other's data or applications. This can lead to a cascade of unauthorized breaches, compromising the confidentiality and integrity of their sensitive information. Proper tenant isolation acts as a fortress, maintaining the confidentiality and integrity of their data.

Compliance

For many SaaS providers, achieving compliance with industry standards and regulations is non-negotiable. Stringent requirements such as GDPR or HIPAA necessitate rigorous data handling practices. Isolating tenants simplifies this task by providing clear boundaries. It ensures that each tenant's data is treated according to the stipulated standards, simplifying compliance implementation and demonstration.

Strategy

In the competitive SaaS landscape, strategic thinking can make all the difference. Isolating tenants opens up the possibility of implementing a tiered service model or tenant specific customizations. This means you can offer different levels of service to tenants based on their unique requirements or subscription plans, providing a distinct competitive edge.

Opportunity

Strong tenant isolation isn't merely a defensive measure; it's also a strategy to create new business opportunities. By offering robust isolation, you can attract high-profile clients with stringent security requirements. This opens doors to increased revenue and substantial business growth.

What are the different isolation strategies for AWS resources?

AWS provides a robust ecosystem for SaaS providers to implement tenant isolation effectively. Let's explore some key isolation strategies.

Infrastructure Isolation

Isolating based on infrastructure helps separate tenants at the foundational level. There are several ways to achieve this in AWS.

Account Isolation

Account Isolation stands out as one of the most straightforward strategies. It revolves around achieving comprehensive isolation by duplicating the entire resource stack for each tenant within an individual AWS account. Creating a dedicated AWS account for each tenant guarantees both resource and permission segregation. Additionally, it streamlines the process of tenant-level billing, providing clear financial separation.

However, it's essential to consider scalability and cost. As the number of tenants grows, this approach may become less practical. While account creation itself is free, duplicating the entire stack for each tenant can incur significant costs. Furthermore, automating the account creation process during tenant onboarding can prove to be a somewhat tedious task.

Network Isolation

Network Isolation is a strategy that capitalizes on key networking components within Amazon Web Services (AWS), such as Virtual Private Clouds (VPCs) and Subnets, to partition and segregate tenants effectively. As your tenant base grows, you can easily scale by adding more VPCs and Subnets to accommodate new tenants. Additionally, automating the process of provisioning and managing tenant-specific network resources is notably more straightforward compared to automating the provisioning of an entire account.

However, it's important to note that Network Isolation also introduces certain challenges, particularly in terms of accounting and billing. With multiple tenants sharing AWS resources, tracking and attributing costs to each tenant can become complex. This is where the need for precise accounting and resource attribution arises. To address these challenges effectively, consider implementing a robust tagging strategy. Tags are labels or metadata that you can associate with AWS resources. By applying tags to resources associated with each tenant, you gain granular control over billing and can generate detailed billing insights for each tenant.

Lets have a look at two different ways to achieve Network Isolation.

VPC Partitioning

This involves creating a separate Virtual Private Cloud (VPC) for each tenant, effectively isolating their network environments.

Subnet Partitioning

Subnet partitioning entails using subnets within a VPC to create network boundaries around tenant resources, providing fine-grained control over network traffic.

Serverless Isolation

Serverless computing (encompassing services such as AWS Lambda, AWS Step Functions, and more) offers a practical solution that eliminates the complexity of tenant specific provisioning altogether. These services are easier to provision and manage, making it a simpler and often more cost-effective solution as compared to provisioning dedicated and often under-utilized compute resources. In essence, Serverless computing liberates you from the intricacies of isolating compute resources. Instead, you can focus on implementing robust data isolation methods to attain comprehensive tenant separation.

Policy Isolation

In a pooled (all shared resources) or bridge (some shared resources) model of multi-tenancy, it is essential to make sure each tenant can only access the resources they are allowed to access. Policy isolation revolves around controlling access and permissions to AWS resources. This requires a robust tenant identity system so you can identify the tenant inside the AWS eco-system and enforce the correct permissions.

This is done, for instance, by exchanging the web identity for a tenant identity token. This token can then be used to recognize a tenant and allow them to assume a specific IAM (Identity and Access Management) role. These IAM roles serve as the cornerstone for provisioning permissions tailored to each tenant's specific role within the system. For instance, you can create distinct roles such as "Tenant Admin" and "Tenant Viewer," each with its defined scope of access.

Policy isolation empowers SaaS providers with the precision needed to navigate the intricacies of shared resources, ensuring that each tenant accesses only what they are entitled to. It's a fundamental component in the arsenal of strategies for achieving comprehensive tenant isolation and maintaining the integrity and security of your SaaS application.

Conclusion

Isolating AWS resources for your multi-tenant SaaS application is not merely a best practice; it is an imperative step to ensure the security, compliance, and scalability of your SaaS offering. By understanding and implementing the diverse isolation strategies AWS offers, you can confidently steer your SaaS offering toward success in a competitive landscape.

AWS provides a wealth of tools and strategies to achieve robust tenant isolation. It's not just about having options; it's about harnessing the full potential of AWS to tailor your isolation approach to your unique requirements. With AWS, you have the flexibility to mix and match different isolation strategies based on your business needs and tenant personas.

As you navigate the dynamic SaaS landscape, remember that tenant isolation is not just a security measure; it's a strategic advantage. It opens doors to new business opportunities, empowers compliance, and positions your SaaS application as a leader in the industry.

So, embrace the power of AWS, explore its isolation strategies, and pave the way for your multi-tenant SaaS success. The journey may be complex, but with the right strategies and a commitment to excellence, your SaaS offering can thrive in the competitive world of multi-tenancy.