Forem: Swapnil Pawar

Manage Private EC2 Instances Without Internet Access Using AWS Systems Manager

Swapnil Pawar — Thu, 27 Jan 2022 17:52:12 +0000

Do you know that you can manage your private EC2 instances using AWS Systems Manager?

Scenario:

Let's say you have multiple EC2 instances deployed in your custom VPC within the private subnet. Due to your security posture requirement, you cant manage your instances directly. You cant ssh into the EC2 instances with a private IP address space assigned to the subnet.

Even with the AWS SSM Instance profile role configured and attached to the EC2 instance, you can’t directly manage the fleet of your private EC2 instances.

How do you solve this?

Virtual Private Cloud endpoint
An interface VPC endpoint (interface endpoint) allows you to connect to services powered by AWS PrivateLink, a technology that allows you to privately access Amazon Elastic Compute Cloud (Amazon EC2) and Systems Manager APIs by using private IP addresses. AWS PrivateLink restricts all network traffic between your managed instances, Systems Manager, and Amazon EC2 to the Amazon network. This means that your managed instances don't have access to the Internet.

Creating VPC endpoints for Systems Manager

Use the following information to create a VPC interface and gateway endpoints for AWS Systems Manager.

Amazon EC2 instances must be registered as managed instances to be managed with AWS Systems Manager.

Follow these steps:

Verify that SSM Agent is installed on the instance.
Create an AWS Identity and Access Management (IAM) instance profile for the Systems Manager. You can create a new role, or add the needed permissions to an existing role.

Attach the IAM role to your private EC2 instance.
Go to EC2 console, select VPC ID and Subnet ID of your private instance.
Now, Go to the Networking & Content Delivery section, select VPC → Endpoints.

For Service Name, select com.amazonaws.[region].ssm (for example, com.amazonaws.us-west-2.ssm). For a full list of Region codes, see Available Regions.
For VPC, choose the VPC ID for your instance.
For Subnets, choose a Subnet ID in your VPC. For high availability, choose at least two subnets from different Availability Zones within the Region.
For Enable DNS name, select Enable for this endpoint. For more information, see Private DNS for interface endpoints.

For the Security group, select an existing security group, or create a new one. The security group must allow inbound HTTPS (port 443) traffic from the resources in your VPC that communicate with the service.
If you created a new security group, open the VPC console, choose Security Groups, and then select the new security group. On the Inbound Rules tab, choose Edit inbound rules. Add a rule with the following details, and then choose Save rules:

For Type, choose HTTPS.
For Source, choose your VPC CIDR. For advanced configuration, you can allow specific subnets' CIDR used by your EC2 instances

Under the policy, You can select the default option “Full Access“ or you can also create a “Custom” policy.

To know more about Custom policies, https://docs.aws.amazon.com/systems-manager/latest/userguide/setup-create-vpc.html#sysman-endpoint-policies

Repeat step 5 with the following change:
For Service Name, select com.amazonaws.[region].ec2messages.

Repeat step 5 with the following change:
For Service Name, select com.amazonaws.[region].ssmmessages.

You must do this if you want to use Session Manager.

After the three endpoints are created, your instance appears in Managed Instances and can be managed using Systems Manager.

Optional: For advanced setup, create policies for VPC interface endpoints for AWS Systems Manager.

Note: If you have more than one subnet in the same Availability Zone, you don't need to create VPC endpoints for the extra subnets. Any other subnets within the same Availability Zone can access and use the interface.

SSM Agent requirements for instances

AWS Systems Manager Agent (SSM Agent) is Amazon software that can be installed and configured on an EC2 instance, an on-premises server, or a virtual machine (VM). SSM Agent makes it possible for the Systems Manager to update, manage, and configure these resources.

If the Amazon Machine Image (AMI) type you choose in the first procedure doesn't come with SSM Agent preinstalled, manually install the agent on the new instance before it can be used with Systems Manager. If SSM Agent isn't installed on the existing EC2 instance you choose in the second procedure, manually install the agent on the instance before it can be used with Systems Manager.

SSM Agent is installed by default on the following AMIs:

Amazon Linux
Amazon Linux 2
Amazon Linux 2 ECS-Optimized Base AMIs
macOS 10.14.x (Mojave), 10.15.x (Catalina), and 11. x (Big Sur)
SUSE Linux Enterprise Server (SLES) 12 and 15
Ubuntu Server 16.04, 18.04, and 20.04
Windows Server 2008-2012 R2 AMIs published in November 2016 or later
Windows Server 2016 and 2019

Note

SSM Agent isn't installed on all AMIs based on Amazon Linux or Amazon Linux 2.

For information about manually installing SSM Agent on other Linux operating systems, see Installing and configuring SSM Agent on EC2 instances for Linux.

Note

The alternative to using a VPC endpoint is to allow outbound internet access on your managed instances. In this case, the managed instances must also allow HTTPS (port 443) outbound traffic to the following endpoints:

ssm.region.amazonaws.com
ssmmessages.region.amazonaws.com
ec2messages.region.amazonaws.com

SSM Agent initiates all connections to the Systems Manager service in the cloud. For this reason, you don't need to configure your firewall to allow inbound traffic to your instances for Systems Manager.

For more information about calls to these endpoints, see Reference: ec2messages, ssmmessages, and other API operations.

Serverless Notification System Implementation With Step Functions Workflow

Swapnil Pawar — Thu, 30 Dec 2021 11:25:41 +0000

Scenario

A client is running call center workloads in production and they are using “Verint“ third-party vendor system that uploads recordings (.tar) to the Amazon S3 bucket for backup and DR purposes.

The third-party Verint system is creating .tar files in the backend and uploading them using an S3 bucket using a multipart processing approach. The issue that we found is that there is no way to track if any chunk of the .tar files has failed to upload to the S3 bucket.

Due to that, it has created an issue from a Compliance point of view.

Approach

To identify the root cause and possible solution in identifying the failure backups in Amazon S3, we need to check S3 server logs and then Develop and configure email alert notifications whenever there is a failure in file backup in Amazon S3.

To know more about S3 Server Access logging, please check the ref link [1]

To start with, we need to check S3 logs to get better visibility of the error and to analyze S3 Server Access Logs at scale, we have used Amazon Athena (Serverless Interactive Query Service) which makes it easy to analyze data in Amazon S3 using standard SQL.

Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

There were going to be multiple other processing steps that were going to involve so we have decided to go with Workflow Orchestration Service (AWS Step Functions) [2] to automate business processes.

The analytic queries in this blog post focus on use case:

Store file upload logs in Amazon Simple Storage Service (Amazon S3 Server Level Logging)
Use Athena query Amazon Simple Storage Service (Amazon S3) server access logs
Use Serverless orchestration using Step Function to Automate the Notification Workflow.

Note: This complete workflow will be run using Cloudwatch Events (CRON JOB) on a daily basis.

Step Function Reference Workflow:

I have used Step Function Workflow Studio to design that makes it faster and easier to build workflows using a drag and drop interface in the AWS console.

Let me show you how easy is to create a state machine using Workflow Studio. To get started, go to the Step Functions console and create a state machine. You will see an option to start designing the new state machine visually with Workflow Studio.

Here are some of the available flow states:

Choice: Adds if-then-else logic.
Parallel: Adds parallel branches.
Map: Adds a for-each loop.
Wait: Delays for a specific time.

The above step function workflow is broken down step by step below:

Step 1:
Fetch Call center Records lambda function executes Athena Query to get the daily records of S3 Server Access Logs from Athena Table. To learn more about query S3 server access logs and create table data, Check link [3]

def lambda_handler(event, context):

    Previous_Date = datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d:%H:%M:%S')
    NDate = datetime.datetime.now() - datetime.timedelta(days=1)
    Next_Date = datetime.datetime.strftime(NDate, '%Y-%m-%d:%H:%M:%S')


    #Rendering Environment Variables
    AthenaDB = os.environ['athena_db']
    AthenaTable = os.environ['athena_table']
    AthenaoutputLocation = os.environ['athena_query_output']

    # number of retries
    RETRY_COUNT = 10

    #Initialize Boto3 Athena Client
    client = boto3.client('athena')

    #Query to get the records of duration when customer connected to contact center till he gets connected to an agent.
    query = """SELECT bucket_name,key,httpstatus,requestdatetime,request_uri,errorcode, count(*) as total FROM "{}"."{}" where {} != '{}' and {} {} and {} BETWEEN {} and {} GROUP BY {}, {}, {}, {}, {}, {};""".format(AthenaDB, AthenaTable, 'httpstatus', '200', 'requester', 'IS NOT NULL', "parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z')", "parse_datetime('"+str(Next_Date)+"','yyyy-MM-dd:HH:mm:ss')", "parse_datetime('"+str(Previous_Date)+"','yyyy-MM-dd:HH:mm:ss')", 'key', 'httpstatus', 'bucket_name', 'requestdatetime', 'request_uri', 'errorcode')
    # Executes Athena Query To Get The BSC Reports
    try:
        # Athena Query Execution
        response = client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={
                'Database': AthenaDB
            },
            ResultConfiguration={
                "EncryptionConfiguration": {
                    "EncryptionOption": "SSE_S3"
                },
                'OutputLocation': AthenaoutputLocation
            }
        )
        if response:
            print("Successfully Executed:"+ response['QueryExecutionId'])
            # get query execution id
            query_execution_id = response['QueryExecutionId']
            print(query_execution_id)

            # get execution status
            for i in range(1, 1 + RETRY_COUNT):

                # get query execution
                query_status = client.get_query_execution(QueryExecutionId=query_execution_id)
                query_execution_status = query_status['QueryExecution']['Status']['State']

                if query_execution_status == 'SUCCEEDED':
                    print("STATUS:" + query_execution_status)
                    break

                if query_execution_status == 'FAILED':
                    raise Exception("STATUS:" + query_execution_status)

                else:
                    print("STATUS:" + query_execution_status)
                    time.sleep(i)
            else:
                client.stop_query_execution(QueryExecutionId=query_execution_id)
                raise Exception('TIME OVER')
            return response
        else:
            return build_internal_error_response("Unexpected error while completing Athena Start Execution API request", str(ex))    
    except Exception as ex:
        print(ex)
        return build_error_response('Customer error while making API request', str(ex))

def build_internal_error_response(internal_error_message, internal_error_details=None):
    return build_error_response(internal_error_message, internal_error_details, 'InternalError', 'InternalError')

def build_error_response(internal_error_message, internal_error_details=None, customer_error_code=None, customer_error_message=None):
    error_response = {
        'internalErrorMessage': internal_error_message,
        'internalErrorDetails': internal_error_details,
        'customerErrorMessage': customer_error_message,
        'customerErrorCode': customer_error_code
    }
    print(error_response)
    return error_response

Step 2: Added Wait Stage to give Athena query a little time to finish query and upload results to S3 bucket.

Step 3: Athena query created a results file name by QueryExecutionId so We are getting Query Execution Id to identity S3 Object in a later state.

Step 4: Added choice state to execute based on Success or Failed response.

Step 5: If Succeded, Based on QueryExecutionId, we are getting a results file from S3.

Step 6 & 7: We are using SES( Simple Email Service) to send email notifications. Since This is a sandbox env of SES, we have verified few identities and getting only verified identities list (In case there are any pending, failed status identities)

Step 8: Created another “Process-***-Records” Lambda function to build up SES email functionality and get the S3 object as an attachment in the email which contains records that are failed to upload to S3.

Step 9: If SES failed to send an email, Sys Admin will be notified of the error to track down the issue.

Step 10: If successful execution, it will be in a Success state otherwise will be in the Failed state.

References:

[1] Logging requests using server access logging - Amazon Simple Storage Service

[2] https://aws.amazon.com/step-functions

[3] Using Amazon S3 access logs to identify requests - Amazon Simple Storage Service
[4] Analyze my Amazon S3 server access logs using Athena

Easily transfer large amounts of data from one Amazon S3 bucket to another bucket

Swapnil Pawar — Fri, 24 Dec 2021 15:36:30 +0000

Recently, while working on a project, I came across the task of moving terabytes (1 TB or more) of data from one Amazon S3 bucket to another S3 bucket.

First of all, you cant copy such a large number of objects using AWS S3 Console. It's not a convenient way and it will take months to copy that data manually.

For this particular use case, I have chosen the “Parallel uploads” option using AWS Command Line Interface (AWS CLI).

So, Depending on your use case, you can perform the data transfer between buckets using one of the following options:

Run parallel uploads using the AWS Command Line Interface (AWS CLI)
Use an AWS SDK
Use cross-Region replication or same-Region replication
Use Amazon S3 batch operations
Use S3DistCp with Amazon EMR
Use AWS DataSync

Resolution:

Run parallel uploads using the AWS CLI

Note: As a best practice, be sure that you're using the most recent version of the AWS CLI. For more information, see Installing the AWS CLI.

You can split the transfer into multiple mutually exclusive operations to improve the transfer time by multi-threading. For example, you can run multiple, parallel instances of aws s3 cp, aws s3 mv, or aws s3 sync using the AWS CLI.

You can create more upload threads while using the --exclude and --include parameters for each instance of the AWS CLI. These parameters filter operations by file name.

Note: The --exclude and --include parameters are processed on the client side. Because of this, the resources of your local machine might affect the performance of the operation.

For example, to copy a large amount of data from one bucket to another where all the file names begin with a test, you can run the following commands on two instances of the AWS CLI. First, run this command to copy the files with names that begin with the text “logs”:

s3://samplebucket-logs/ s3://sampledestbucket-logs/test --recursive --exclude "*" --include "logs2019-09-16*" --profile profile1

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

Then, run this command to copy the files with names that begin with the different dates for eg. 2021-04-02 and 2021-04-03:

s3://samplebucket-logs/ s3://sampledestbucket-logs/logs-audit-april-2021/ --recursive --exclude "*" --include "logs2021-04-02*" --include "logs2021-04-03*" --profile profile1

Additionally, you can customize the following AWS CLI configurations to speed up the data transfer:

multipart_chunksize: This value sets the size of each part that the AWS CLI uploads in a multipart upload for an individual file. This setting allows you to break down a larger file (for example, 300 MB) into smaller parts for quicker upload speeds.
Note: A multipart upload requires that a single file is uploaded in not more than 10,000 distinct parts. You must be sure that the chunk size that you set balances the part file size and the number of parts.

max_concurrent_requests: This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10. You can increase it to a higher value like 50.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum amount of concurrent requests that you want.

Read more about --exclude and --include filters and how to use them: https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters

Use an AWS SDK

Consider building a custom application using an AWS SDK to perform the data transfer for a very large number of objects. While the AWS CLI can perform the copy operation, a custom application might be more efficient at performing a transfer at the scale of hundreds of millions of objects.

Use cross-Region replication or same-Region replication

After you set up cross-Region replication (CRR) or same-Region replication (SRR) on the source bucket, Amazon S3 automatically and asynchronously replicates new objects from the source bucket to the destination bucket. You can choose to filter which objects are replicated using a prefix or tag. For more information on configuring replication and specifying a filter, see the Replication configuration overview.

After replication is configured, only new objects are replicated to the destination bucket. Existing objects aren't replicated to the destination bucket. To replicate existing objects, you can run the following cp command after setting up replication on the source bucket:

aws s3 cp s3://samplebucket-logs s3://sampledestbucket-logs --recursive --storage-class STANDARD

This command copies objects in the source bucket back into the source bucket, which triggers replication to the destination bucket.

Note: It's a best practice to test the cp command in a non-production environment. Doing so allows you to configure the parameters for your exact use case.

Use Amazon S3 batch operations

You can use Amazon S3 batch operations to copy multiple objects with a single request. When you create a batch operation job, you specify which objects to perform the operation on using an Amazon S3 inventory report. Or, you can use a CSV manifest file to specify a batch job. Then, Amazon S3 batch operations call the API to perform the operation.

After the batch operation job is complete, you get a notification and you can choose to receive a completion report about the job.

Use S3DistCp with Amazon EMR

The S3DistCp operation on Amazon EMR can perform parallel copying of large volumes of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to the worker nodes in an Amazon EMR cluster. Then, the operation writes the files from the worker nodes to the destination bucket. For more guidance on using S3DistCp, see Seven tips for using S3DistCp on Amazon EMR to move data efficiently between HDFS and Amazon S3.

Important: Because this option requires you to use Amazon EMR, be sure to review Amazon EMR pricing.

Use AWS DataSync

To move large amounts of data from one Amazon S3 bucket to another bucket, perform the following steps:

Open the AWS DataSync console.
Create a task.
Create a new location for Amazon S3.
Select your S3 bucket as the source location.
Update the source location configuration settings. Make sure to specify the AWS Identity Access Management (IAM) role that will be used to access your source S3 bucket.
Select your S3 bucket as the destination location.
Update the destination location configuration settings. Make sure to specify the AWS Identity Access Management (IAM) role that will be used to access your S3 destination bucket.
Configure settings for your task.
Review the configuration details.
Choose to Create task.
Start your task.

References:

https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

Configuration As Code Using Amazon EC2 Systems Manager

Swapnil Pawar — Fri, 24 Dec 2021 15:19:15 +0000

Amazon EC2 Systems Manager (SSM) lets you configure, manage and automate your AWS and on-premises resources at scale. You can perform safe and secure operations without SSH access or bastion hosts using Systems Manager Run Command, mitigate configuration drift using Systems Manager State Manager, and create an access-controlled environment with full auditing.

With SSM Documents, you can author your configurations as code and enable centralized management across accounts, enforcing best practices. Systems Manager provides a number of public documents for common management scenarios, or you can create your own document for deployment.

What Is Configuration as Code?

Configuration as code is the practice of managing configuration files in a repository. Config files establish the parameters and settings for applications, operating systems, etc. By managing your config files alongside your code, you can help streamline your release pipeline.

Benefits of Using Configuration as Code?

Scalability

Maintaining configuration changes as code allows to edit, update and create from the central location using a consistent deployment strategy.

Standardization

When you write configuration as code, you can implement other operations like testing, scanning, and linting. Having config files reviewed and tested before they are committed ensures that changes follow your team’s standards. If you have a complex microservices architecture, this can keep your configurations stable and consistent.

Traceability

By storing configuration as code in the repository, we get the benefit of tracking changes in code and config files. If a bug does slip in, you have the ability to trace the source of the problem. You can diff the versioned config files to see what went wrong and fix it quickly.

Configuration As Code acts as a Single Source of Truth for your build pipeline.

Other things you can do :

Execute various types of scripts written in Python, Ruby, or PowerShell. You can also run configurations such as Ansible playbooks. You can pretty much-run anything on your instances as long as the software (e.g., Python 3.8 or Ansible) is installed on your instance and recognized by Shell on Linux and PowerShell on Windows.
Download scripts stored in private or public GitHub repositories, or on Amazon S3 onto your instances for execution.
Run multiple files by downloading a complete GitHub directory or an S3 bucket.

Use Case:

Find the AWS-RunRemoteScript document for execution

On the AWS SSM console, on the navigation pane at the left, under Node Management Services, choose Run Command. Choose Run a Command, and then select the AWS-RunRemoteScript document and the instances you want to execute this document on (whether a list of instances or tag-queries).

Reference the python playbook located on GitHub

Enter the parameters for the AWS-RunRemoteScript Document to reference the Ansible playbook.

Source Type: Location of the script – GitHub, S3. In this case, choose GitHub.
Source Info: Provides location information for accessing the content. In this example, since the repository is private, you need to provide an access token from GitHub, the owner, repository, and the path to the python script. So we’ll download the script, which includes an example.py script file.

Run Python Script from private GitHub repository

Now, I’ll show you how to execute scripts from private GitHub repositories. Let’s assume that the custom python script in this example is stored in a private GitHub repository. To access this script, you need to create a private access token on GitHub and store it in Amazon EC2 Systems Manager Parameter Store.

Step 1: Create your GitHub personal access token

Create a personal access token for your private GitHub repo to give the Systems Manager access to the playbook. Personal API tokens are a way to provide access to systems to access information from your private GitHub repository. These tokens provide limited access to a subset of repository data as well as the ability to revoke access when needed. You can create a personal access token from the information provided here and then save the token value.

Step 2: Store the tokens in Parameter Store

After creating the personal access token, go to Parameter Store on the SSM console. On the Parameter Store page (Under Application Management), create a parameter and add the token you created on GitHub here, in the Value text box.

Step 3: Reference the Python script located on GitHub

Along with owner, repository, and path, we will add “tokenInfo” which refers to the example-token secure string parameter that we just created. The reference is made using the ssm-secure prefix.

{ "owner": "spawar1991", "repository": "AWS-SSM-Demos", "getOptions": "branch:dev", "path": "scripts/python", "tokenInfo": "{{ssm-secure:<Parameter-Store-Parameter-Name>}}" }

Select the targets where you want to execute the script and click on the run command.

If you want to write command output to the S3 bucket or Cloudwatch Logs, under the Output options section, you can mention the log group.

You can also view the Run command output in the SSM console. Go to Systems Manager → NOde Management → Run Command

Click on the “Command history” Tab. Open the last run command → check Targets & output section

Now, Click on Instance Id column and you’ll be able to see the execution steps with output and error section. You can expand each section to see the output.

Conclusion

In this, I showed you how AWS Systems Manager is a management platform that lets you use your existing tools to manage your AWS resources and environments. I showed you how to use Systems Manager to run a Python Script on your EC2 instances from a public and private GitHub repository. Using the AWS-RunRemoteScript public document and aws:runShellScript plugins, you can run any script such as Python, Ruby, or even PowerShell scripts or modules.

Google Cloud Run Combines Serverless with Containers

Swapnil Pawar — Wed, 01 Sep 2021 13:34:26 +0000

When it comes to managed Kubernetes services, Google Kubernetes Engine (GKE) is a great choice if you are looking for a container orchestration platform that offers advanced scalability and configuration flexibility. GKE gives you complete control over every aspect of container orchestration, from networking to storage, to how you set up observability—in addition to supporting stateful application use cases.

However, if your application does not need that level of cluster configuration and monitoring, then a fully managed Cloud Run might be the right solution for you.

Cloud Run is a fully-managed compute environment for deploying and scaling serverless containerized microservices.

Fully managed Cloud Run is an ideal serverless platform for stateless containerized microservices that don’t require Kubernetes features like namespaces, co-location of containers in pods (sidecars), or node allocation and management.

You must be thinking, Why Cloud Run?

Cloud Run is a fully managed compute environment for deploying and scaling serverless HTTP containers without worrying about provisioning machines, configuring clusters, or autoscaling.

The managed serverless compute platform Cloud Run provides a number of features and benefits:

Easy deployment of microservices. A containerized microservice can be deployed with a single command without requiring any additional service-specific configuration. Si
Simple and unified developer experience. Each microservice is implemented as a Docker image, Cloud Run’s unit of deployment.
Scalable serverless execution. A microservice deployed into managed Cloud Run scales automatically based on the number of incoming requests, without having to configure or manage a full-fledged Kubernetes cluster. Managed Cloud Run scales to zero if there are no requests, i.e., uses no resources.
Support for code written in any language. Cloud Run is based on containers, so you can write code in any language, using any binary and framework.
No vendor lock-in - Because Cloud Run takes standard OCI containers and implements the standard Knative Serving API, you can easily port over your applications to on-premises or any other cloud environment.
Split traffic - Cloud Run enables you to split traffic between multiple revisions, so you can perform gradual rollouts such as canary deployments or blue/green deployments.
Automatic redundancy - Cloud Run offers automatic redundancy so you don’t have to worry about creating multiple instances for high availability

Cloud Run is available in two configurations:

Fully managed Google Cloud Service.
Cloud Run For Anthos (s (this option deploys Cloud Run into an Anthos GKE cluster).

Cloud Run is a layer that Google built on top of Knative to simplify deploying serverless applications on the Google Cloud Platform.

Google is one of the first public cloud providers to deliver a commercial service based on the open-source Knative project. Like the way it offered a managed Kubernetes service before any other provider, Google moved fast in exposing Knative through Cloud Run to developers.

Knative has a set of building blocks for building a serverless platform on Kubernetes. But dealing with it directly doesn’t make developers efficient or productive. While it acts as the meta-platform running on the core Kubernetes infrastructure, the developer tooling and workflow are left to the platform providers.

How does Cloud Run work?

Cloud Run service can be invoked in the following ways:

HTTPS: You can send HTTPS requests to trigger a Cloud Run-hosted service. Note that all Cloud Run services have a stable HTTPS URL. Some use cases include:

Custom RESTful web API
Private microservice
HTTP middleware or reverse proxy for your web applications
Prepackaged web application

gRPC: You can use gRPC to connect Cloud Run services with other services—for example, to provide simple, high-performance communication between internal microservices. gRPC is a good option when you:

Want to communicate between internal microservices
Support high data loads (gRPC uses protocol buffers, which are up to seven times faster than REST calls)
Need only a simple service definition you don't want to write a full client library
Use streaming gRPCs in your gRPC server to build more responsive applications and APIs

WebSockets: WebSockets applications are supported on Cloud Run with no additional configuration required. Potential use cases include any application that requires a streaming service, such as a chat application.

Trigger from Pub/Sub: You can use Pub/Sub to push messages to the endpoint of your Cloud Run service, where the messages are subsequently delivered to containers as HTTP requests. Possible use cases include:

Transforming data after receiving an event upon a file upload to a Cloud Storage bucket
Processing your Google Cloud operations suite logs with Cloud Run by exporting them to Pub/Sub
Publishing and processing your own custom events from your Cloud Run services

Running services on a schedule: You can use Cloud Scheduler to securely trigger a Cloud Run service on a schedule. This is similar to using cron jobs.

Possible use cases include:

Performing backups on a regular basis
Performing recurrent administration tasks, such as regenerating a sitemap or deleting old data, content, configurations, synchronizations, or revisions
Generating bills or other documents

Executing asynchronous tasks: You can use Cloud Tasks to securely enqueue a task to be asynchronously processed by a Cloud Run service.

Typical use cases include:

Handling requests through unexpected production incidents
Smoothing traffic spikes by delaying work that is not user-facing
Reducing user response time by delegating slow background operations, such as database updates or batch processing, to be handled by another service,
Limiting the call rate to backend services like databases and third-party APIs

Events from Eventrac: You can trigger Cloud Run with events from more than 60 Google Cloud sources. For example:

Use a Cloud Storage event (via Cloud Audit Logs) to trigger a data processing pipeline
Use a BigQuery event (via Cloud Audit Logs) to initiate downstream processing in Cloud Run each time a job is completed

How is Cloud Run different from Cloud Functions?

Cloud Run and Cloud Functions are both fully managed services that run on Google Cloud’s serverless infrastructure, auto-scale, and handle HTTP requests or events. They do, however, have some important differences:

Cloud Functions lets you deploy snippets of code (functions) written in a limited set of programming languages, while Cloud Run lets you deploy container images using the programming language of your choice.
Cloud Run also supports the use of any tool or system library from your application; Cloud Functions does not let you use custom executables.
Cloud Run offers a longer request timeout duration of up to 60 minutes, while with Cloud Functions the requests timeout can be set as high as 9 mins.
Cloud Functions only sends one request at a time to each function instance, while by default Cloud Run is configured to send multiple concurrent requests on each container instance. This is helpful to improve latency and reduce costs if you're expecting large volumes.

If you enjoyed this article, you might also like:

Intro to Deployment Strategies: Blue-Green, Canary, and More

Swapnil Pawar — Thu, 26 Aug 2021 17:36:46 +0000

Whether we mean to or not, software deployments look different across organizations, teams, and applications. This can make pushing the deployment button feel like playing a game of craps: you roll the dice and try to stay alive. Luckily, there are a few ways to limit the variance in success. This blog post will discuss the different strategies and practices that can help you succeed with your production deployments.

Deployment Strategies to Consider

Deployment strategies are practices used to change or upgrade a running instance of an application. The following sections will explain six deployment strategies. Let’s start with discussing the basic deployment.

The Basic Deployment

In a basic deployment, all nodes within a target environment are updated at the same time with a new service or artifact version. Because of this, basic deployments are not outage-proof and they slow down rollback processes or strategies. Of all the deployment strategies shared, it is the riskiest.

Pros:
The benefits of this strategy are that it is simple, fast, and cheap. Use this strategy if 1) your application service is not business, mission, or revenue-critical, or 2) your deployment is to a lower environment, during off-hours, or with a service that is not in use.

Cons:
Of all the deployment strategies shared, it is the riskiest and does not fall into best practices. Basic deployments are not outage-proof and do not provide for easy rollbacks.

The Multi-Service Deployment

In a multi-service deployment, all nodes within a target environment are updated with multiple new services simultaneously. This strategy is used for application services that have service or version dependencies, or if you’re deploying off-hours to resources that are not in use.

Pros:
Multi-service deployments are simple, fast, cheap, and not as risk-prone as a basic deployment.

Cons:
Multi-service deployments are slow to roll back and not outage-proof. Using this deployment strategy also leads to difficulty in managing, testing, and verifying all the service dependencies.

Rolling Deployment

A rolling deployment is a deployment strategy that updates running instances of an application with the new release. All nodes in a target environment are incrementally updated with the service or artifact version in integer N batches.

Pros:
The benefits of a rolling deployment are that it is relatively simple to roll back, less risky than a basic deployment, and the implementation is simple.

Cons:
Since nodes are updated in batches, rolling deployments require services to support both new and old versions of an artifact. Verification of an application deployment at every incremental change also makes this deployment slow.

Blue-Green Deployment

Blue-green deployment is a deployment strategy that utilizes two identical environments, a “blue” (aka staging) and a “green” (aka production) environment with different versions of an application or service. Quality assurance and user acceptance testing are typically done within the blue environment that hosts new versions or changes. User traffic is shifted from the green environment to the blue environment once new changes have been testing and accepted within the blue environment. You can then switch to the new environment once the deployment is successful.

Pros:
One of the benefits of the blue-green deployment is that it is simple, fast, well-understood, and easy to implement. Rollback is also straightforward, because you can simply flip traffic back to the old environment in case of any issues. Blue-green deployments are therefore not as risky compared to other deployment strategies.

Cons:
Cost is a drawback to blue-green deployments. Replicating a production environment can be complex and expensive, especially when working with microservices. Quality assurance and user acceptance testing may not identify all of the anomalies or regressions either, and so shifting all user traffic at once can present risks. An outage or issue could also have a wide-scale business impact before a rollback is triggered, and depending on the implementation, in-flight user transactions may be lost when the shift in traffic is made.

Canary Deployment

A canary deployment is a deployment strategy that releases an application or service incrementally to a subset of users. All infrastructure in a target environment is updated in small phases (e.g: 2%, 25%, 75%, 100%). A canary release is the lowest risk-prone, compared to all other deployment strategies, because of this control.

Pros:
Canary deployments allow organizations to test in production with real users and use cases and compare different service versions side by side. It’s cheaper than a blue-green deployment because it does not require two production environments. And finally, it is fast and safe to trigger a rollback to a previous version of an application.

Cons:
Drawbacks to canary deployments involve testing in production and the implementations needed. Scripting a canary release can be complex: manual verification or testing can take time, and the required monitoring and instrumentation for testing in production may involve additional research.

A/B Testing

In A/B testing, different versions of the same service run simultaneously as “experiments” in the same environment for a period of time. Experiments are either controlled by feature flags toggling, A/B testing tools, or through distinct service deployments. It is the experiment owner’s responsibility to define how user traffic is routed to each experiment and version of an application. Commonly, user traffic is routed based on specific rules or user demographics to perform measurements and comparisons between service versions. Target environments can then be updated with the optimal service version.

he biggest difference between A/B testing and other deployment strategies is that A/B testing is primarily focused on experimentation and exploration. While other deployment strategies deploy many versions of a service to an environment with the immediate goal of updating all nodes with a specific version, A/B testing is about testing multiple ideas vs. deploying one specific tested idea.

Pros:
A/B testing is a standard, easy, and cheap method for testing new features in production. And luckily, there are many tools that exist today to help enable A/B testing.

Cons:
The drawbacks to A/B testing involve the experimental nature of its use case. Experiments and tests can sometimes break the application, service, or user experience. Finally, scripting or automating AB tests can also be complex.

Which Deployment Strategy Should I Use?

Now that we know different deployment techniques, a commonly asked question may be, which deployment strategy should I use? The answer depends on the type of application you have and your target environment.

Based on conversations with Harness customers, most teams use blue-green or canary deployments for mission-critical web applications. Customers have minimal to little business impact when migrating from the blue-green deployment strategy to a canary deployment strategy. It’s also common for teams to create their strategy based on combining the strategies we shared in this blog post. For example, some customers will do multi-service canary deployments.

Eliminating After-Hours Deployments

Software delivery is challenging. Anyone who has a deployment horror story can attest to this. One way that we can eliminate toil and spend time and efforts where it really matters is to leverage some deployment strategies and practices that can help with operationalizing our services.

Some practices or standards to consider implementing include:

service-specific deployment checklists
Continuous Integration (CI) and Continuous Delivery (CD)
Well-defined and understood environments
Build automation tooling
Configuration management tools
Communication channels like Slack
An on-call or incident response playbook
Automated rollbacks

A good portion of these practices can help with server or service downtime, software bugs, continuous feedback, and new application deployments. Aside from creating a foundation for better software delivery, there are also opportunities to leverage automation alongside our metrics and monitoring tools through the practice of Continuous verification (CV).

Continuous Verification

Continuous verification utilizes data and operationalizing tool stacks to take action based on the performance and quality of application deployment. With our customers who use Harness today, CV helps them with:

Deployment Verification – inside the verify step of a deployment pipeline leading to auto-rollback and manual rollback as failure strategies.
And 24×7 Service Guard – always-on, change impact analysis that measures overall service health and correlates it with deployments.

There can be many questions about operationalizing deployments across different tools, dependencies, and environments. Automating some of these challenges away is the next generation of scaling and simplifying software delivery.

Want To Learn MLOps?

Swapnil Pawar — Mon, 16 Aug 2021 06:44:50 +0000

MLOps is not a piece of cake. Especially in today’s changing environment. There are many challenges—construction, integrating, testing, releasing, deployment, and infrastructure management. You need to follow good practices and know-how to adjust to the challenges.

Being an emerging field, MLOps is rapidly gaining momentum amongst Data Scientists, ML Engineers, and AI enthusiasts. Following this trend, the Continuous Delivery Foundation SIG MLOps differentiates the ML models management from traditional software engineering and suggests the following MLOps capabilities:

MLOps aims to unify the release cycle for machine learning and software application release.

MLOps enables automated testing of machine learning artifacts (e.g. data validation, ML model testing, and ML model integration testing)

MLOps enables the application of agile principles to machine learning projects.

MLOps enables supporting machine learning models and datasets to build these models as first-class citizens within CI/CD systems.

MLOps reduces technical debt across machine learning models.

MLOps must be a language-, framework-, platform-, and infrastructure-agnostic practice.

And if you don’t learn and develop your knowledge, you’ll fall out of the loop. The right resources can help you follow the best practices, discover helpful tips, and learn about the latest trends.

You don’t have to look far. Here’s your list of the best go-to resources about MLOps—books, articles, podcasts, and more.

Let’s dive in!

[1] Introducing MLOps from O’Reilly

Introducing MLOps: How to Scale Machine Learning in the Enterprise is a book written by Mark Treveil and the Dataiku Team (collective authors). It introduces the key concepts of MLOps, shows how to maintain and improve ML models over time, and tackles the challenges of MLOps.

The book is divided into three parts:

An introduction to the topic of MLOps, how and why it has developed as a discipline, who needs to be involved to execute MLOps successfully, and what components are required.

The second part follows the machine learning model life cycle, with chapters on developing models, preparing for production, deploying to production, monitoring, and governance.

Provides tangible examples of how MLOps looks in companies today, so readers can understand the setup and implications in practice.

[2] What Is MLOps? from O’Reilly

What Is MLOps? Generating Long-Term Value from Data Science & Machine Learning by Mark Treveil and Lynn Heidmann is a thorough report for business leaders who want to understand and learn about MLOps as a process for generating long-term value while reducing the risk associated with data science, ML, and AI projects.

Here’s what the report includes:

Detailed components of ML model building, including how business insights can provide value to the technical team

Monitoring and iteration steps in the AI project lifecycle–and the role business plays in both processes

How components of a modern AI governance strategy are intertwined with MLOps

Guidelines for aligning people, defining processes, and assembling the technology necessary to get started with MLOps.

[3] Google Cloud

MLOps: Continuous delivery and automation pipelines in machine learning is a document from Google that “discusses techniques for implementing and automating continuous integration (CI), continuous delivery (CD), and continuous training (CT) for machine learning (ML) systems.”

If you’re new to MLOps, this document can be a great source of knowledge as it touches on some basic concepts. But if you’re the MLOps veteran, you’ll also find it helpful to refresh and solidify your knowledge. It can also help reliably build and operate ML systems at scale.

[4] Awesome MLOps and production machine learning GitHub lists

An Awesome list is a thematic curated catalog of resources, hosted in the form of a GitHub repository containing only a README file.

In our case, two very useful lists are the Awesome MLOps and the Awesome Production Machine Learning. While the former focuses on learning resources, the latter complements it with an emphasis on tooling.

These lists are useful when you already have a comprehensive view of the MLOps field and you would like to specialize in a given subdomain, such as model serving and monitoring.

[5] Stanford MLSys Seminar Series

The Stanford MLSys Seminar Series is, as the name suggests, a series of seminars focused on machine learning and ML systems—tools and all the technology used for programming machine learning models.

[6] Awesome MLOps

This is An awesome list of references for MLOps – Machine Learning Operations from ml-ops.org

It’s a list of links to numerous resources, beginning with books, articles, to communities, and many, many more. In a word—it has everything you could possibly read about MLOps. The table of contents includes among others: MLOps Papers, Talks About MLOps, Existing ML Systems, Machine Learning, Software Engineering Product Management for ML/AI, The Economics of ML/AI, Model Governance, Ethics, Responsible AI.

References:
Towards Data Science

Full Stack Deep Learning

3-best-free-online-resources-to-learn-mlops

Where Can You Learn About MLOps? What Are the Best Books, Articles, or Podcasts to Learn MLOps? - neptune.ai

Machine Learning Operations – MLOps | Microsoft Azure

7 Best Resources To Learn MLOps In 2021

@SeattledataGuy