Forem: Sachin Agarwal

Cloud - aws, gcp, azure, openstack; automation (Terraform, Salt, Fabric, etc.) and performance testing/optimization.

Sachin Agarwal — Sat, 25 Aug 2018 21:23:40 +0000

Choosing Cloud Providers and Virtual Machines: The Easy Way

Sachin Agarwal — Mon, 30 Jul 2018 14:22:10 +0000

The tool may not render correctly on lower-resolution mobile device screens; for the best experience please use a desktop.

Our VM and cloud provider optimization tool displays relative CPU performance of different virtual machines across different providers. The current tool covers many VMs of up to 16 vcpus from Amazon AWS, Google GCP, Microsoft Azure and Digital Ocean. The CPU performance is reported based on our VM CPU benchmarking tests. Other VM characteristics, such as the number of vCPUs, RAM, and cost have been extracted via provider APIs or (web-)scraped from cloud provider documentation; these are periodically updated to reflect the latest data available from providers.

How it works

The user interface has a series of input controls on the left-hand side and a stacked-bar chart in the center of the screen. The stacked-bars show the CPU utilization of the corresponding VM (in pink) and the amount of "unutilized" CPU (in green). Information about the VM, such as number of vCPUs, RAM and cost, is printed across each bar corresponding to that VM. When the user changes the input controls the back-end is queried and VMs with the lowest cost that satisfy the input control constraints will be rendered on the stacked-bar plot.

The key feature of the tool is that the CPU utilization of each VM depicted on stacked-bars chart is representative of the "same" workload being applied to each VM. For example, if the CPU utilization (pink bar) is 50% for VM_1 and 25% for another VM_2, then the conclusion is that when the same workload is applied to VM_1 and VM_2, the corresponding CPU utilizations will be 50% and 25% respectively. This underlying property lets us compare different VMs' CPU characteristics.

Fig.1: The BigBitBus Cloud Provider and VM Optimizer Tool User Interface

Input Controls on the User Interface

The user can query the performance and catalogue data we have collected by changing the settings on the left-side panels. The different settings are explained below:

Cloud Provider and Baseline Machine

Choose the baseline virtual machine. The cloud provider dropdown is used to filter available VMs by provider; the baseline machine can then be selected from the presented choices. We are constantly working to increase the size of this catalog. The top-most bar on the chart always corresponds to the baseline VM.

CPU Utilization

There are two sliders in this box. The CPU utilization slider represents the CPU utilization of the baseline VM. For example, if your monitoring dashboards indicate that the baseline VM has a peak CPU utilization of 70% then you should set this slider to 70%.

The second slider is to set the maximum CPU utilization of the target (alternatives to the baseline) VMs that will be returned by the tool. For example, if you want that the CPU utilization of the target VMs should not exceed 50% then this slider can be set to 50%. If you are willing to push the CPU utilization to a higher number, set this slider to a high percentage and the tool will find smaller (and usually cheaper) VMs that run "hotter" when servicing the applied workload.

Target Clouds

The Target clouds check-boxes let you select which cloud providers' VMs are returned by the tool. For example if you only want to consider AWS VMs then select the AWS check-box and unselect all other cloud provider check-boxes.

Refine Search

The minimum and maximum number (amount) of vCPUs (RAM) can be constrained using these minimum/maximum range sliders. Only target machines which satisfy these constraints will be displayed.

Use-cases

We illustrate possible uses of the tool through a few use-cases.

Use-case 1: Switching a Cloud Provider

A user who wishes to research different cloud providers and switch the baseline VM running in a cloud provider to another cloud provider can use the tool to find the performance and cost of analogous VMs on different cloud providers. Select the baseline VM and choose target clouds, along with any CPU utilization or vCPU number/RAM constraints and the tool will return up to 9 target VMs that satisfy all the requested constraints at the lowest cost.

Use-case 2: Lowering costs by down-sizing a VM

If a user finds that a VM is idling (low CPU utilization) then s/he can select this VM type as the baseline VM and set the CPU utilization to the low number. Then, from the options displayed in the chart, a smaller VM which is better utilized for the same workload can be selected. This is a great way to reduce cloud spend.

Use-case 3: Choosing a VM with greater CPU headroom

Suppose a user finds that her/his VMs are running hot (high CPU utilization) then the tool can help find appropriate VMs that run cooler. Instead of guessing and switching to a much bigger VM (ending up with unnecessarily low utilization), the user can fix the "maximum CPU utilization" slider to the desired peak "hotness" and the tool will return the lowest cost VMs that satisfy the constraints.

FAQs

Why did the tool stop responding after a while?
We throttle requests to protect our servers. If you hit the throttle limit, please wait till an hour has passed before using the tool.
Why are other cloud providers not included in the comparison?
We are working toward integrating more providers' data into our system; please check back as we expand our provider coverage. If you represent a cloud provider then please contact us so we can work on accelerating on-boarding your offerings into our tools.
Why don't I see data for the entire catalogue of the providers?
We currently focus on VMs with 16 or fewer vCPUs (since these comprise the vast majority of deployed VMs); we have also excluded high memory or storage-optimized VMs (since all our testing is CPU based currently); please check back as we expand our VM coverage.
How accurate is the tool?
Generic CPU benchmarks, such as the ones that form the basis of this tool, are rarely representative of actual production workloads' performance. The tool's data is a basis for comparing different VMs and gives us a "rule-of-thumb" or "back-of-envelope" comparison between different VMs to quickly whittle down the myriad VM choices across cloud providers. We encourage users to investigate short-listed VMs thoroughly against their custom workloads with their custom testing to switch VMs and providers with confidence.
I found an inconsistency/bug in the tool. How to report it/get it fixed?
Fantastic! The tool is new and in beta testing, please help us by emailing any bugs, ideas, comments, or concerns to contact@bigbitbus.com.
Which cloud provider datacenters/regions were used in our testing?
We primarily used eastern US cloud regions to perform all performance testing; cost data is also limited to this region. Giving users the ability to select specific data-centers in different cloud provider regions is on our product road map.
What prices are shown by the tool?
We show retail costs (no discounts). We are aware that cloud providers offer sustained usage discounts, volume discounts, negotiated customer-specific discounts and other promotions to customers. Allowing users to apply such discounts to the cost numbers shown in the tool is on our product road map.
Does the tool compare other characteristics like IO and network latency?
This tool compares VMs on the basis of CPU utilization. Building analogous tools for IO and network comparisons is on our product road map.
What is the VM Comparer tool link on the top of the page? We are working on another tool to compare 2 VMs between different cloud providers. This tool is still in alpha as we collect better data. Feel free to give it a spin and please help us by emailing any bugs, ideas, comments, or concerns to contact@bigbitbus.com.

Back to the VM and cloud provider optimization tool

BigBitBus is on a mission to bring greater transparency in public cloud and managed big data and analytics services.

Public Cloud Object-store Performance is Very Unequal across AWS S3, Google Cloud Storage, and Azure Blob Storage

Sachin Agarwal — Wed, 06 Jun 2018 13:05:26 +0000

For this article we compared the object-store performance of Amazon Web Services (S3), Google Cloud Storage and Microsoft Azure Blobs in locally redundant configurations (without geo-replication). We found very significant performance differences that can have a direct impact on user applications.

Object or blob store services on the cloud offer content addressable storage where users can save arbitrary files that can tbe accessed via a URL over HTTP(s) connections and simple CRUD semantics (GET to download, PUT to upload etc.). Object storage is convenient and cheap, and this has made it the storage back-end of choice for everything from small configuration files of less than a few kilobytes to huge VM images or backup archives. It is also the most common storage option for persisting raw data files used in big data analyses.

Lower object-store latency (time to upload and download files) is important in many use cases. For example, the time taken to download a backup copy of a database will be the dominant factor in the recovery time objective for disaster recovery planning. Big data applications such as Apache spark may seem sluggish if the back-end object-store hosting raw data has a high file-serving latency. There are many applications that repeated and frequently read and write small files to object stores (e.g. image thumbnails); these will benefit from lower latency small object performance.

Our key findings are:

Large blob downloads are significantly slower (up to 4x) in Azure as compared to Google cloud storage or AWS S3 large object downloads.
Small-sized Azure blobs have lower upload latency.
In general the (relatively newer) Canadian regions have lower latency for object store operations as compared to the older US east regions.

Setup

We setup locally redundant object-store buckets for AWS-S3, Google cloud storage, and Azure blob storage in a cloud region and created one virtual machine (per provider) in the same cloud region. By "locally redundant" we mean that the objects were not geo-replicated to another region; we will be analyzing geo-replicated objects in another article.

Fig.1: Test Setup for locally-redundant object-store testing. We report the upload and download latency of the client putting/getting objects to/from the object-store.

A load tester virtual machine was loaded up with our custom-built open-source benchmarking program called object bench that can upload and download different sized randomly-generated files to the object-stores. This tool uses python SDKs from each of the providers (so the client implementation is strictly as per provider standards). The tool was setup to serially upload and download different-sized randomly-generated files (ranging from 1kB to 100MB in size). We repeated the experiment 100 times and all our results are averaged over these 100 runs; we also show error-bars in our plots.

Results

We measured latency as seen by an application which uploads and downloads objects from the object-store. We present results for a US east region and a Canadian region for each provider (the exact names differ across providers). By selecting two different regions for each provider we eliminated the possibility of a bad load testing VM client or a badly configured object store in a specific region. We also unearthed performance differences between the regions for the same cloud provider; users looking for the best performance on public cloud object-stores should carefully benchmark performance differences across regions before choosing a specific region. All cloud regions do not have the same performance.

US Region

We chose us-east-1, us-east1 and eastus regions for AWS, Google cloud and Azure respectively (collectively referred to as USEast in the below plots). The load testing VMs were spun up in one of the zones belonging to these regions for each cloud provider.

Small object sizes

Figs.2 and 3 show small object upload and download latencies in US East regions. The Azure blob store offers significantly lower upload latency as compared to AWS S3 or Google Cloud Storage. Its hard to say why the stark difference without knowing the implementation. We have a controversial hypothesis - perhaps uploads (writes) to the Azure blob store are cached in memory (to be persisted on disk later) and the acknowledgement sent immediately to the uploading client.

Fig.2: Small objects (up to 100kB) upload latency in US East

Fig.3: Small objects (up to 100kB) download latency in US East

Large object sizes

Figs.4 and 5 show large object upload and download latencies in US East regions. The performance of all three object-stores is very similar for uploads. The strikingly slower Azure download is the highlight here (Fig. 5). We think this is a serious problem in Azure - especially for the backup/restore use-case. The data says that a 100MB object takes over 4 seconds to download from Azure blob-store, as compared to ~1 second in Google cloud storage. Downloading a 10GB backup set composed of 100 such 100MB objects will take over 4000 seconds in Azure as compared to only 1000 seconds in Google cloud. That is a huge hit on the recovery time objective for Azure users.

Fig.4: Large objects (1MB - 100MB) upload latency in US East

Fig.5: Large objects (1MB - 100MB) download latency in US East

Object Deletion

Fig.6 shows the deletion latency for different-sized objects. The notable feature here is the consistency in the Google cloud (GCP) numbers.

Fig.6: Object deletion latency in US East

Canadian Region

We repeated all the above experiments on Canadian public cloud regions. Figs.7-11 show the corresponding Canadian region numbers. Notice the different Y-axis on some of these graphs; in general the latency numbers are lower in Canadian regions than US East regions. We hypothesize that this is because of the relative newness and lower utilization of the Canadian regions. The same superior small-object performance and dismal large-blob download performance of Azure blobs was seen in these results as well.

We chose ca-central-1, northamerica-northeast1 and canadacentral regions for AWS, Google cloud and Azure respectively (collectively referred to as Canada in the below plots).

Small object sizes

Fig.7: Small objects (up to 100kB) upload latency in Canada

Fig.8: Small objects (up to 100kB) download latency in Canada

Large object sizes

Fig.9: Large objects (1MB - 100MB) upload latency in Canada

Fig.10: Large objects (1MB - 100MB) download latency in Canada

Object Deletion

Fig.11: Object deletion latency in Canada

Outlook

The latency metrics reported in this article are critical for many user applications. Our results show a clear disadvantage when using the Azure blob store for large objects - operations like restoring backups, downloading large media files and VM images, etc. The Azure service wins for small object sizes - uploads were consistently faster than AWS S3 and Google cloud storage object stores. Object deletion time is important for applications that update, save and delete a large number of temporary objects. We were impressed by the consistency in the Google cloud deletion times as compared to other object stores.

Our aim was to capture performance differences due to different object-store implementations. Given the performance differences across the implementations we hope the engineering teams behind these services will tune and improve their systems to bring their systems at par with the best.

Stay tuned as we investigate geo-replicated object performance, cold-storage object stores and object metadata performance in this series.

This article was first published at www.bigbitbus.com

Sachin Agarwal is a computer systems researcher and the founder of BigBitBus.

BigBitBus is on a mission to bring greater transparency in public cloud and managed big data and analytics services.