Forem: vaibhav bedi

Cloud Automation: Stop Clicking Buttons and Start Shipping Faster

vaibhav bedi — Sat, 15 Nov 2025 22:19:23 +0000

If you're still manually clicking through cloud portals to provision resources, you're working too hard. Cloud automation isn't just a nice-to-have anymore - it's the difference between shipping features quickly and spending your Friday nights babysitting deployments.

The Problem With Manual Cloud Management

Let me paint a familiar picture. You need to spin up a new environment. You log into AWS or Azure, click through a dozen screens, copy settings from production (hopefully correctly), configure networking, set up security groups, provision databases, configure monitoring, and two hours later you're done. Then someone asks you to do it again for staging. And again for the QA environment.

Manual processes don't scale. They're error-prone, inconsistent, and honestly boring. You became a developer to write code, not to be a professional button-clicker.

What Cloud Automation Actually Means

Cloud automation means using code and tools to provision, configure, and manage your cloud infrastructure without human intervention. Instead of clicking through a portal, you write a script or configuration file that describes what you want, and the automation tool makes it happen.

This applies to everything: virtual machines, databases, storage buckets, networking, security policies, monitoring alerts, and even user permissions. If you can create it manually, you can automate it.

Infrastructure as Code: The Foundation

Infrastructure as Code (IaC) is where cloud automation starts. You describe your infrastructure in files that can be versioned, reviewed, and reused.

Terraform is the most popular cross-cloud option. You write HCL configuration files that describe your infrastructure, and Terraform figures out how to create it. It works across AWS, Azure, GCP, and hundreds of other providers. The same skillset works everywhere.

resource "aws_instance" "web_server" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name = "web-server"
    Environment = "production"
  }
}

CloudFormation (AWS), ARM templates (Azure), and Deployment Manager (GCP) are cloud-specific options. They're deeply integrated with their respective clouds but lock you into that ecosystem.

Pulumi lets you write infrastructure code in real programming languages like Python, TypeScript, or Go instead of learning a DSL. If you prefer actual code over configuration files, Pulumi might be your thing.

Pick one tool and get good at it. Doesn't matter which one you choose - the principles are the same. Just commit to using IaC for everything new.

Configuration Management: Beyond Provisioning

Provisioning infrastructure is only half the battle. You still need to configure the OS, install software, apply security patches, and manage ongoing changes.

Ansible is straightforward and agentless. You write YAML playbooks that describe the desired state, and Ansible makes it happen over SSH. No agents to install or maintain.

Chef and Puppet are more traditional configuration management tools. They're powerful but have a steeper learning curve. Both use agents running on managed nodes.

Cloud-Init is built into most cloud VM images. It handles initial configuration when an instance first boots. Great for basic setup tasks that only run once.

For containerized workloads, configuration management looks different. You bake configuration into container images or use Kubernetes ConfigMaps and Secrets. The container orchestration platform handles the rest.

CI/CD Pipelines: Automation in Motion

Your infrastructure code is worthless if you're running it manually. CI/CD pipelines automate the entire deployment process from code commit to production.

A typical pipeline for infrastructure changes:

Developer commits infrastructure code changes
CI system runs validation and linting
Automated tests verify the changes work
Pipeline creates a plan showing what will change
After approval, pipeline applies changes to staging
Automated tests verify staging works
After validation, pipeline deploys to production
Monitoring confirms everything is healthy

GitHub Actions, GitLab CI, Jenkins, CircleCI, Azure DevOps - they all work. Pick what integrates with your existing tools.

The key is that humans write code and approve changes, but machines do the actual deployment work. No manual steps, no forgotten configurations, no "works on my machine" problems.

Auto-Scaling: Let Demand Drive Resources

Why pay for capacity you're not using? Auto-scaling automatically adjusts resources based on actual demand.

Cloud providers offer built-in auto-scaling for compute resources. Define minimum and maximum instance counts, set scaling policies based on CPU, memory, or custom metrics, and let the platform handle it.

Kubernetes takes this further with Horizontal Pod Autoscaling and Cluster Autoscaling. Pods scale based on resource usage or custom metrics. The cluster itself scales nodes up or down based on pod demands.

Serverless takes auto-scaling to the extreme. Functions scale automatically from zero to thousands of concurrent executions. You literally don't think about capacity.

Automated Backup and Disaster Recovery

Hope is not a backup strategy. Automate your backups so you don't have to remember to take them.

Most cloud services offer automated backup options. Enable them. Set retention policies. Test restores regularly. Automate the restore testing too - if you can't restore, your backups are useless.

Infrastructure as Code makes disaster recovery easier. Your infrastructure is defined in code, so recreating it in another region or account is just running your automation again. Add data replication and you have a solid DR strategy.

Cost Optimization Through Automation

Cloud bills can spiral out of control. Automation helps keep costs under control.

Schedule non-production environments to shut down outside business hours. A simple script can stop instances at 6 PM and start them at 8 AM on weekdays. That's 128 hours of savings every week.

Right-sizing scripts analyze actual resource usage and recommend smaller instance types. Run these monthly and adjust accordingly.

Automated cleanup removes unused resources. Tag everything with creation dates and owners. Scripts can identify resources that haven't been used in 90 days and either delete them or flag them for review.

Reserved instances and savings plans require commitment, but automation can analyze usage patterns and recommend optimal purchases.

Security Automation: Shift Left

Security can't be an afterthought. Build it into your automation from the start.

Scan infrastructure code for security issues before deployment. Tools like tfsec, Checkov, and Terrascan find problems in Terraform code. They integrate into CI/CD pipelines to block insecure configurations.

Automate compliance checks. Cloud Custodian, AWS Config Rules, and Azure Policy continuously monitor resources and enforce compliance. Resources that violate policies get automatically remediated or flagged.

Secret management should be automated. Never hardcode credentials. Use tools like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault. Rotate secrets automatically on a schedule.

Monitoring and Alerting: Close the Loop

Automation without monitoring is flying blind. You need to know when things break.

Instrument everything. Logs, metrics, and traces should be automatically collected from all resources. Use agents or native integrations - just make sure everything reports somewhere central.

Automated alerting based on anomalies catches problems you didn't anticipate. Traditional threshold alerts are good, but anomaly detection finds unusual patterns that might indicate issues.

Auto-remediation takes monitoring further. When certain alerts fire, trigger automated responses. Disk full? Auto-scale storage. Service unresponsive? Restart it automatically. Document and test these remediations carefully - you don't want automation making things worse.

Real-World Example: Full Stack Automation

Here's how this comes together. An e-commerce company I worked with automated their entire deployment pipeline.

Infrastructure is defined in Terraform. Developers change code, the pipeline runs terraform plan, shows what will change, and after approval applies it. New application versions trigger Docker builds. The pipeline pushes images to a registry, updates Kubernetes manifests, and deploys to staging.

Automated tests run against staging. If they pass, the pipeline waits for human approval, then deploys to production using a blue-green deployment. If error rates spike, automated rollback reverts to the previous version.

Non-production environments shut down at night and on weekends. Cost optimization scripts run weekly and recommend right-sizing. Security scans happen on every commit. Compliance checks run continuously.

The result? They deploy 20+ times per day with minimal manual intervention. Deployment time dropped from hours to minutes. Incidents decreased because configurations are consistent. New developers become productive faster because everything is documented in code.

Getting Started: Don't Boil the Ocean

You don't need to automate everything at once. Start small and build momentum.

Pick one repetitive task that drives you crazy. Automate that first. Maybe it's creating development environments or deploying a specific application. Get that working, learn from it, then move to the next task.

Use existing modules and templates. Don't reinvent the wheel. Terraform Registry, AWS Solutions Library, and Azure Quickstart Templates provide battle-tested starting points.

Document your automation. Future you will thank present you when something breaks at 2 AM and you need to remember how it works.

Version control everything. Your infrastructure code should live in Git alongside your application code. Use pull requests, code reviews, and all the same practices you use for application code.

The Tools Don't Matter (Much)

People get religious about tools. Terraform versus CloudFormation, Ansible versus Chef, AWS versus Azure. These debates miss the point.

The specific tools matter less than the practice of automation itself. Pick tools that work for your team and your cloud provider. Learn them deeply. The principles transfer even if you switch tools later.

Common Pitfalls

Over-automation: Don't automate things that rarely change or are critical without having proper safeguards. Start with safe, repeatable tasks.

Poor error handling: Automation fails. Build in proper error handling, logging, and alerting so you know when things go wrong.

No testing: Test your automation in non-production environments first. Use plan/preview features to see what will change before applying it.

Ignoring drift: Resources changed outside automation create drift. Either prevent manual changes through policies or regularly reconcile drift back to the desired state.

The Bottom Line

Cloud automation transforms how you work. You spend less time on repetitive tasks and more time on things that matter. Deployments become faster and more reliable. Costs stay under control. Security improves.

The initial investment in learning automation tools pays off quickly. Yes, writing Terraform takes longer than clicking through a portal the first time. But the tenth time? The hundredth time? Automation wins decisively.

Start automating today. Pick one task, automate it, and build from there. Your future self will thank you.

What are you automating in your cloud environment? Share your wins (and failures) in the comments.

Troubleshooting Real-World Network Outages in Microsoft Azure

vaibhav bedi — Sat, 15 Nov 2025 22:16:52 +0000

Network outages in Azure can be stressful. One minute everything's running smoothly, the next you're getting alerts that your application is unreachable. I've been through enough of these incidents to know that having a systematic approach makes all the difference between panic and resolution.

The 3 AM Wake-Up Call

Picture this: your monitoring alerts are going off, your application isn't responding, and you need to figure out what's wrong. Fast. Azure's network stack is powerful but complex, with virtual networks, subnets, network security groups, route tables, and service endpoints all playing together. When something breaks, knowing where to look is half the battle.

Start With the Basics

Before diving into Azure-specific tools, verify the obvious stuff. I know it sounds basic, but I've seen too many incidents where we skipped this and wasted time:

Can you reach the Azure portal? If you can't, the issue might be on your end or a broader Azure service disruption. Check the Azure Status page first.

Is your service actually down? Sometimes monitoring gets it wrong. Try accessing your application from different locations or networks. Your office network might have issues while everything else works fine.

Check the Azure Service Health dashboard. Microsoft might already know about the outage. Navigate to Service Health in the portal and check for any incidents in your region.

Network Security Groups: The Silent Killers

NSGs are probably responsible for more outages than people want to admit. They're easy to misconfigure and the results are immediate.

Open your NSG in the portal and check the inbound and outbound rules. Look for recently modified rules - someone might have made a change that broke connectivity. Azure keeps an activity log, so you can see who changed what and when.

Use the IP Flow Verify tool in Network Watcher. This tool tells you whether traffic is allowed or denied between two points, and which NSG rule is responsible. It's saved me hours of manual rule checking.

Route Tables and Unexpected Paths

Routes control where traffic goes, and a misconfigured route table can send your traffic into a black hole. This happens more often than you'd think, especially after someone adds a new subnet or makes changes to a firewall appliance.

Check your route tables through the portal or use Azure CLI. Look for routes with a next hop type of "None" - these explicitly drop traffic. Also watch for routes that point to network virtual appliances that might be down.

The Next Hop tool in Network Watcher shows you exactly where traffic will go for a given source and destination. Use it to trace your traffic path and find where things go wrong.

DNS Issues Are Network Issues Too

DNS problems look like network outages but they're actually resolution failures. Your application can't reach the database because it can't resolve the hostname to an IP address.

If you're using Azure Private DNS zones, verify that your VNet is actually linked to the zone. A missing link means your VMs can't resolve private DNS names.

Check your DNS servers in the VNet configuration. Custom DNS servers are a common source of problems. If you've configured custom DNS, make sure those servers are reachable and functioning.

Service Endpoints and Private Endpoints

These features are great for security but they add complexity. If your application suddenly can't reach Azure Storage or SQL Database, check whether service endpoints or private endpoints are involved.

Service endpoints route traffic to Azure services through the Microsoft backbone network instead of the internet. They require specific configuration on both the VNet subnet and the Azure service. Missing either side breaks connectivity.

Private endpoints create a private IP address for an Azure service inside your VNet. If someone deleted or misconfigured a private endpoint, your application loses access. Check the Private Link Center to see all your private endpoints and their status.

Network Watcher Connection Monitor

This tool continuously monitors connectivity between resources. If you haven't set it up before an outage, you can still use the Connection Troubleshoot feature to test connectivity right now.

Connection Troubleshoot checks whether a VM can reach another VM, an external endpoint, or an Azure service. It shows you the exact path traffic takes and where it fails. It checks NSGs, routes, and even effective routes to give you a complete picture.

Application Gateway and Load Balancer Health

Your networking might be fine but your load balancer's backend pool could be unhealthy. Check the backend health in Application Gateway or Load Balancer. If all backends are showing as unhealthy, the issue might be with your health probes, not the actual backends.

Common health probe failures include incorrect probe paths, wrong ports, or overly aggressive timeout settings. Review your probe configuration and test manually using curl or a browser.

Effective Security Rules

Azure applies NSG rules at both the subnet and NIC level. The combination of these rules determines what traffic is actually allowed. Use the Effective Security Rules view in the portal to see the final set of rules that apply to a specific NIC. This resolves the confusion when you have NSGs at multiple levels.

Real-World Scenario: The Disappearing Database

Here's a recent example. Application suddenly couldn't connect to Azure SQL Database. Portal showed the database was running, no Azure service issues, application logs showed connection timeouts.

Started with IP Flow Verify - traffic was allowed by NSGs. Checked routes - everything looked normal. Then checked the SQL Database firewall rules. Someone had removed the subnet's service endpoint access during a cleanup task. Added the subnet back to the firewall rules and connectivity restored immediately.

The lesson? Always check service-specific firewall rules, not just network-level security.

Diagnostic Logs Are Your Friend

Enable diagnostic logging for your network resources. NSG flow logs show you exactly what traffic is being allowed or denied in real-time. This data goes to a storage account or Log Analytics workspace where you can query it.

You can use Log Analytics queries to find patterns. Looking for a spike in denied connections? Query your NSG flow logs. Want to see if traffic is actually reaching your subnet? Flow logs have the answer.

The Recovery Checklist

When you're in the middle of an outage, having a checklist helps you stay systematic:

Verify the outage is real and affects users
Check Azure Service Health for known issues
Review recent changes in the Activity Log
Verify NSG rules using IP Flow Verify
Check route tables using Next Hop
Test DNS resolution
Verify service endpoint and private endpoint configurations
Check load balancer backend health
Review service-specific firewall rules
Check diagnostic logs for patterns

Prevention Is Better Than 3 AM Fixes

Set up Connection Monitor to continuously test critical paths. Enable NSG flow logs. Use Azure Policy to prevent certain risky configurations. Tag your resources properly so you can track relationships.

Document your network architecture. When things break, you need to quickly understand what connects to what. A simple diagram saves time when you're troubleshooting under pressure.

Wrapping Up

Azure network troubleshooting gets easier with experience, but having the right tools and approach makes a huge difference. Network Watcher is your best friend. The Activity Log shows you what changed. And sometimes the issue is as simple as a checkbox someone unchecked.

The next time you face a network outage, take a breath, work through the checklist, and use the tools Azure gives you. You'll find the problem faster than you think.

What's your worst Azure network outage story? Drop it in the comments - we've all been there.

Azure vs OCI Load Balancers & Traffic Routing: What Actually Matters

vaibhav bedi — Sat, 15 Nov 2025 22:13:16 +0000

Azure vs OCI Load Balancers & Traffic Routing: What Actually Matters

Load balancers are one of those things everyone uses but nobody really thinks about until something breaks or the bill arrives. I've been working with both Azure and OCI load balancers for different projects, and the approaches are different enough that it's worth getting into the weeds.

The Product Lineup

Let's start with what you're actually choosing from, because both clouds have multiple load balancing services and the naming isn't always helpful.

Azure gives you:

Azure Load Balancer (L4, regional)
Azure Application Gateway (L7, regional, includes WAF)
Azure Front Door (L7, global, CDN + routing)
Traffic Manager (DNS-based global routing)
Cross-region Load Balancer (preview/GA depending on when you read this)

OCI gives you:

Load Balancer (L4 and L7 combined, regional)
Network Load Balancer (L4, ultra-low latency)
Traffic Management Steering Policies (DNS-based routing)

Right off the bat, OCI's lineup is simpler. One load balancer does both L4 and L7, which is conceptually cleaner. Azure split these because they evolved separately, and while that gives you more targeted options, it also means more decision paralysis.

Layer 4: The Basic Building Block

Azure Load Balancer

Azure's L4 load balancer is solid and boring, which is what you want. You create a load balancer (Standard SKU is the only one that matters anymore), add a frontend IP configuration, create a backend pool, define health probes, and set up load balancing rules. It supports both inbound and outbound scenarios.

The Standard SKU is zone-redundant by default, which is great. It also supports multiple frontend IPs on a single load balancer, useful for hosting multiple services. HA Ports is a feature that lets you load balance all ports with a single rule, which sounds niche until you need it for NVAs.

Health probes are straightforward: HTTP, HTTPS, or TCP. You set an interval and threshold, and if a backend fails, it's taken out of rotation. The one thing that trips people up is the default 15-second timeout - sometimes you need to tune this for slow-starting applications.

Pricing is per rule plus data processed. The data processing charge is $0.005/GB, which adds up but isn't terrible. Where it gets expensive is when you need multiple load balancers for isolation or different configurations.

OCI Load Balancer (L4 mode)

OCI's Load Balancer can operate in L4 mode, but honestly, if you want pure L4, you probably want the Network Load Balancer instead. It's newer, faster, and purpose-built.

The Network Load Balancer is OCI's answer to AWS's Network Load Balancer, and it's genuinely impressive from a performance standpoint. We're talking microsecond-level latency, preservation of source IP by default, and the ability to handle millions of requests per second per instance.

Unlike Azure's load balancer, the Network Load Balancer is truly transparent - clients see the actual backend IPs. This is huge for protocols that care about source IP, and you don't need to mess with X-Forwarded-For headers or connection draining strategies.

Configuration is simpler than Azure: backend sets, listeners, and health checks. That's basically it. The health checks are TCP or HTTP/HTTPS, similar to Azure.

Pricing is per hour plus data processed (called "bandwidth" in OCI pricing docs). It's generally cheaper than Azure, especially at scale. An OCI Network Load Balancer processes data at $0.008/GB versus Azure's $0.005/GB, but the hourly cost is lower, so it depends on your traffic patterns.

Winner for L4?

For most use cases, they're comparable. Azure's Load Balancer is more mature with more features and better documentation. OCI's Network Load Balancer is faster and cheaper at scale. If you're doing high-throughput, latency-sensitive work, OCI wins. For everything else, it's a wash.

Layer 7: Where It Gets Interesting

Azure Application Gateway

Application Gateway is Azure's L7 load balancer, and it's feature-packed. You get URL-based routing, host-based routing, SSL termination, end-to-end SSL, cookie-based session affinity, WebSocket support, custom health probes, and integration with Azure Web Application Firewall.

The architecture is straightforward: you create an Application Gateway with a frontend IP, add backend pools, create HTTP settings (which define how Application Gateway talks to backends), and then set up routing rules that tie listeners to backend pools.

URL-based routing lets you send /api/* to one backend pool and /images/* to another, which is genuinely useful for microservices. Host-based routing lets you serve multiple domains from the same gateway. The WAF integration is solid - you get OWASP Core Rule Set protection with minimal config.

Here's what people don't tell you: Application Gateway is slow to provision. We're talking 20-30 minutes for initial deployment. Updates are also slow. If you're doing infrastructure-as-code with frequent rebuilds, this gets old fast.

Autoscaling exists but works differently than you'd expect. Application Gateway v2 (the current version) scales based on compute units, which are a function of connection count, throughput, and compute. You set min and max instance counts, and Azure handles the scaling. It works, but it's not as responsive as I'd like - expect 3-5 minutes for scale-out operations.

The pricing model is complicated: you pay per hour for the gateway itself, per compute unit hour, and for data processing. A small gateway with moderate traffic might cost $200-400/month. A large gateway with autoscaling and WAF can easily hit $2000+/month. Read the pricing page carefully.

OCI Load Balancer (L7 mode)

OCI's Load Balancer handles both L4 and L7 traffic, and you choose when you configure your listeners. For L7, it supports path routing, hostname routing, SSL termination, session persistence, and health checks.

What's nice: you get both capabilities in one service. What's less nice: the configuration model is less intuitive than Application Gateway if you're used to Azure.

You define backend sets (groups of backends), listeners (frontends that accept traffic), and then routing policies within listeners that determine where traffic goes. Path-based routing uses "route rules" within the listener configuration. It works, but the mental model took me a minute to grasp.

The Web Application Firewall is a separate service in OCI, not integrated into the load balancer like Azure. You create a WAF policy and attach it to the load balancer. This separation is cleaner architecturally but means more moving parts.

Performance is good - I haven't run into latency issues. Provisioning is faster than Azure Application Gateway, usually 5-10 minutes. Updates are also faster.

Pricing is simpler: you pay per hour based on the shape (bandwidth capacity) you choose, plus data processed. A 10 Mbps load balancer is around $20-30/month, a 400 Mbps is around $150/month, plus the $0.008/GB for processed data. For equivalent capacity, it's often cheaper than Application Gateway.

Winner for L7?

Application Gateway has more features and better documentation, especially around complex routing scenarios. If you need tight WAF integration or you're already deep in Azure, it's the obvious choice.

OCI's Load Balancer is simpler, faster to provision, and cheaper. If you don't need every feature and want something that just works, it's compelling.

Global Load Balancing and Traffic Management

This is where the platforms diverge significantly.

Azure Front Door

Front Door is Azure's global L7 load balancer with CDN capabilities. You configure backends across multiple regions, and Front Door routes users to the best backend based on latency, health, and your routing preferences.

The feature set is extensive: URL routing, session affinity, custom domains with SSL, caching, DDoS protection, and WAF. You can do A/B testing by splitting traffic percentages. The Edge locations are Azure's CDN POPs, so global coverage is excellent.

The killer feature is the intelligent routing. Front Door constantly monitors backend health and latency from its edge locations. If a backend goes down or gets slow, traffic automatically fails over. This happens in seconds, not minutes. For global applications where downtime is expensive, this is huge.

Caching works well if your content is cacheable. You define caching rules, and Front Door serves cached content from edge locations. This reduces load on your backends and improves latency for end users.

The gotchas: Front Door is expensive. You pay per routing rule, per custom domain, per GB of data transfer out, and per request. A moderate-traffic site can easily hit $500-1000/month. High-traffic sites pay more. The cost scales roughly linearly with traffic, which can be painful.

Also, Front Door's configuration model is complex. You've got front-end hosts, backend pools, routing rules, and rules engines. The learning curve is steep.

Azure Traffic Manager

Traffic Manager is DNS-based routing, not a true load balancer. You create a Traffic Manager profile, add endpoints (which can be Azure resources or external IPs), and choose a routing method: priority, weighted, performance, geographic, or multivalue.

Performance-based routing sends users to the closest endpoint based on DNS resolution latency. Geographic routing sends users to specific endpoints based on their location. Weighted routing lets you split traffic for A/B testing or gradual rollouts.

Traffic Manager is cheap - $0.54/million DNS queries plus $0.36 per monitored endpoint. For most workloads, this is under $50/month.

The limitation is that it's DNS-based, so you're subject to DNS TTL and caching. Failover isn't instant - it depends on clients respecting TTL, which not all do. For critical workloads, you probably want Front Door. For cost-sensitive scenarios where 30-60 second failover is acceptable, Traffic Manager works fine.

OCI Traffic Management Steering Policies

OCI's global routing is entirely DNS-based, similar to Traffic Manager. You create a steering policy and attach it to a DNS zone. The policies support failover, load balancing, geolocation steering, and ASN steering (routing based on autonomous system number, which is niche but cool).

Health checks are built in - you define endpoints and health check configurations, and OCI removes unhealthy endpoints from DNS responses.

The interface is less polished than Azure's, and the documentation is thinner. But it works, and it's included in your OCI subscription without per-query charges beyond standard DNS pricing.

Winner for Global Routing?

Front Door is the most powerful option by far, but you pay for it. If you need intelligent routing with sub-second failover and don't mind the cost, it's unmatched.

Traffic Manager and OCI's steering policies are comparable - both DNS-based, both cheap, both limited by DNS behavior. Choose based on which cloud you're already using.

The Stuff That Actually Matters in Production

SSL/TLS Termination

Both platforms handle this well. Azure Application Gateway and Front Door support SNI (Server Name Indication) so you can host multiple SSL sites on one load balancer. You can use Azure Key Vault for certificate storage, which is convenient.

OCI Load Balancer also supports SNI and lets you store certificates directly in the load balancer or use OCI Vault. The certificate renewal story is less automated than Azure, though. Azure has better integration with Let's Encrypt and automatic cert renewal.

Session Persistence

Azure Application Gateway supports cookie-based session affinity. It inserts a cookie and ensures subsequent requests from the same client go to the same backend. This works fine but breaks if your backends scale down and that specific instance disappears.

OCI Load Balancer supports application cookie persistence (you specify the cookie name) or load balancer-generated cookie persistence. The flexibility is nice if your app already uses session cookies.

For serious session management, though, you should be using a distributed cache or database anyway. Sticky sessions at the load balancer are a band-aid.

Connection Draining

Azure Application Gateway calls this "connection drain timeout." When you remove a backend or it fails health checks, existing connections get a grace period to finish before being forcibly closed. Default is 30 seconds, max is 500 seconds.

OCI Load Balancer has the same concept, called "connection drain timeout." Default is 300 seconds, max is 3600 seconds (one hour, which seems excessive but okay).

Both work as expected. Set this based on how long your backend requests typically take, with a safety margin.

Observability

Azure wins here, not even close. Application Gateway and Front Door integrate beautifully with Azure Monitor. You get metrics out of the box - request count, response time, healthy/unhealthy host count, backend response time, and more. You can set up alerts, create dashboards, and query logs in Log Analytics.

Diagnostic logs give you detailed request/response information, including headers, which is invaluable for debugging. Front Door also logs cache hit rates and origin latency.

OCI Load Balancer integrates with OCI Monitoring and Logging services, but it's more barebones. You get basic metrics - bandwidth, connections, health status - but not as much detail. Access logs exist but require manual parsing. There's no equivalent to Azure's Application Insights integration.

If you need deep visibility into traffic patterns and performance, Azure's tooling is significantly better.

Reliability and SLAs

Azure Application Gateway Standard_v2 has a 99.95% SLA. Front Door has 99.99%. Traffic Manager has 99.99%.

OCI Load Balancer (both types) has a 99.95% SLA.

In practice, I've had more issues with Azure Application Gateway than OCI Load Balancer, but that's anecdotal and probably a function of traffic volume and configuration complexity.

Real-World Decision Points

Choose Azure Load Balancer if:

You're already in Azure and need basic L4 load balancing
You need HA Ports for network virtual appliances
You want zone redundancy without thinking about it

Choose OCI Network Load Balancer if:

You need ultra-low latency (microseconds matter)
You're handling very high connection rates
You want source IP preservation without X-Forwarded-For hacks
You're cost-conscious and pushing serious traffic

Choose Azure Application Gateway if:

You need sophisticated L7 routing (URL paths, hostnames, headers)
WAF integration is important
You're comfortable with Azure pricing and the provisioning time
You want excellent observability and monitoring

Choose OCI Load Balancer if:

You want one service that does both L4 and L7
You prefer simpler pricing and faster provisioning
You don't need every possible feature, just the core ones done well
Budget is tight

Choose Azure Front Door if:

You need global distribution with intelligent routing
Sub-second failover matters for your business
You want integrated CDN capabilities
You can afford premium pricing

Choose DNS-based routing (Traffic Manager or OCI) if:

You need basic global routing
30-60 second failover is acceptable
You want to minimize cost
Your traffic patterns are predictable

What I Actually Use

For internal services in Azure, I use Azure Load Balancer. It's simple and cheap enough that I don't overthink it.

For public-facing APIs in Azure, I use Application Gateway with WAF. The cost hurts, but the security and routing features justify it. I've learned to automate the slow provisioning times with parallel deployments.

For global applications, I use Front Door despite the cost. The performance and failover capabilities are worth it when downtime directly impacts revenue.

For OCI, I use Network Load Balancer for backend services and the regular Load Balancer for public-facing apps. The pricing is friendly enough that I don't worry about optimizing too hard.

The honest truth? Both platforms have good load balancing options. Azure has more features and better observability. OCI is simpler and cheaper. Pick based on what matters more for your specific use case, not based on theoretical maximums you'll never hit.

And for the love of all that's holy, test your failover scenarios before you need them. I've seen too many "highly available" architectures fall apart because nobody actually tested what happens when a backend dies.

Microsoft Azure vs OCI Networking: A Deep Dive

vaibhav bedi — Sat, 15 Nov 2025 22:08:33 +0000

Microsoft Azure vs OCI Networking: A Deep Dive

So you're evaluating cloud providers and you've gotten past the usual suspects. Azure's everywhere, obviously, but Oracle Cloud Infrastructure keeps popping up in conversations, especially when people talk about networking performance and cost. I spent the last few months working with both platforms pretty heavily, and honestly, the networking models are different enough that it's worth digging into the details.

The Mental Models Are Different

This is the thing that hit me first. Azure and OCI approach networking from fundamentally different philosophies, and once you understand that, everything else makes more sense.

Azure feels like it evolved. Because it did. You've got Virtual Networks (VNets), but then you've also got Classic VNets (deprecated but still haunting documentation), then Service Endpoints, then Private Link, then VNet peering, then Virtual WAN. Each feature was added to solve a problem, and while they all work, you're dealing with layers of abstraction that sometimes feel like archaeological strata.

OCI feels like someone sat down and said "what if we designed this from scratch in 2016, knowing everything we know now?" The result is cleaner but less forgiving if you don't understand the fundamentals. There's a Virtual Cloud Network (VCN), and that's pretty much it. Everything else is routing rules, security lists, and network security groups. It's more Unix-philosophy: do one thing well.

VNets vs VCNs: The Foundation

Azure VNets give you a /16 by default, though you can go smaller or bigger. You carve them up into subnets, and subnets are where you actually attach resources. Subnets can span availability zones, which is both convenient and slightly terrifying from a failure domain perspective.

OCI VCNs also start with a CIDR block (up to /16), but subnets work differently. Each subnet lives in a specific availability domain (OCI's term for AZ), and you explicitly choose whether it's public or private when you create it. This forces you to think about your architecture upfront, which I've learned to appreciate even when it's annoying.

The addressing flexibility in Azure is better if you're doing complex hybrid scenarios. Azure lets you modify address spaces on existing VNets, add address ranges, even swap them around. OCI is stricter - you define your CIDR blocks upfront and you're mostly stuck with them.

The Peering Story

Azure VNet Peering is straightforward: you peer two VNets, traffic flows between them at Azure backbone speeds, and you pay for data transfer. You can peer globally, which is legitimately useful. The gotcha is that peering is non-transitive by default, so if VNet A peers with VNet B, and VNet B peers with VNet C, A and C can't talk. You need to set up a hub-and-spoke with NVAs or use Virtual WAN to get around this.

OCI uses Local Peering Gateways (LPG) for VCNs in the same region and Dynamic Routing Gateways (DRG) for cross-region. The DRG is actually pretty slick - it acts as a regional router that can connect multiple VCNs, on-premises networks via FastConnect, and even VCNs in other regions. Transitivity is built-in if you route through a DRG, which saves a lot of headache.

One thing that surprised me: OCI's inter-region traffic between VCNs is free if you use their backbone. Azure charges you for cross-region VNet peering. When you're moving serious data between regions, this adds up.

Internet Connectivity and NAT

Azure gives you a few options. You can assign public IPs directly to resources, use a NAT Gateway for outbound from private subnets, or run traffic through an NVA. The NAT Gateway is fully managed and priced per hour plus data processing. It's fine, works as expected.

OCI has NAT Gateways too, but also Internet Gateways. The difference matters: an Internet Gateway is for resources with public IPs that need inbound and outbound access. A NAT Gateway is for private resources that only need outbound. You attach these to your VCN's route tables. Coming from AWS, this felt familiar. Coming from Azure, it felt like extra steps, but it gives you more granular control.

One weird OCI quirk: you can attach a public IP to a resource in a private subnet, but it won't work unless you have the right route table rules. This has bitten me exactly once, and that was enough.

Security: NSGs and All That

Azure Network Security Groups attach to subnets or individual NICs. Rules are priority-based (lower number = higher priority), and you get default rules you can't delete. The portal UI for managing complex NSG rules is... not great. I end up using ARM templates or Terraform for anything non-trivial.

OCI has both Security Lists and Network Security Groups. Security Lists are the old way, applied at the subnet level. NSGs are newer, more flexible, and work more like AWS security groups - they're stateful, you attach them to VNIC level, and they're generally the preferred approach now. The documentation pushes you toward NSGs, and you should listen.

OCI's default security posture is more locked down. Azure tends to be permissive by default (especially with PaaS services), and you lock things down. OCI makes you explicitly allow traffic. Neither approach is wrong, but you need to know which world you're in.

Load Balancing

Azure's got Application Gateway (L7, WAF-capable), Load Balancer (L4, regional or global with cross-region), and Front Door (global L7 with CDN). Front Door is genuinely great for global applications, but the pricing can shock you if you're not careful.

OCI has Load Balancers (L4/L7 in the same service, which is cleaner conceptually) and Network Load Balancers for ultra-low-latency L4. The performance on OCI's Network Load Balancers is wild - single-digit microsecond latency in some scenarios. If you're doing high-frequency trading or real-time gaming, this matters. For most of us, it's overkill.

Configuration-wise, Azure's load balancers have more knobs to turn. OCI's are simpler but less flexible. Pick your poison.

Private Connectivity to Other Services

Azure's story here has evolved into Private Link, which is honestly elegant once you understand it. You create a Private Endpoint in your VNet, it gets a private IP, and you can connect to Azure PaaS services or your own services over the Azure backbone. No public internet, no service endpoints weirdness, just private IPs. The DNS side can be tricky, though. You need private DNS zones, and linking them correctly to your VNets is a common source of frustration.

OCI uses Service Gateways for private access to Oracle Services (Object Storage, etc.) without traversing the internet. It's more limited in scope than Private Link but works well for what it covers. For your own services, you're generally using private IPs within your VCN and relying on the network fabric.

Hybrid Connectivity

Azure ExpressRoute is mature, widely available, and works with basically every major carrier. You get private peering for your VNets, Microsoft peering for M365/Dynamics, and options for Global Reach to connect your on-prem locations through Azure's backbone. The pricing is per port plus data transfer, and it gets expensive fast in higher tiers.

OCI FastConnect is similar conceptually but simpler in practice. You connect to OCI, you get access to your VCNs via DRG, done. The pricing is generally lower than ExpressRoute, especially for higher bandwidths. OCI also has some interesting partnerships with Azure for direct interconnection between the two clouds, which is useful if you're running Oracle databases in OCI but everything else in Azure.

The Azure-OCI Interconnect deserves its own mention. Microsoft and Oracle partnered to create dedicated connections between their clouds in certain regions. If you're running Oracle databases in OCI and need to connect them to Azure-hosted apps, this is way better than going over the internet. Latency is single-digit milliseconds in supported regions.

DNS and Service Discovery

Azure DNS is solid. You can host public zones, create private zones for internal resolution, and link private zones to VNets. The integration with Private Link means DNS just works for private endpoints, once you've set up the zones.

OCI uses a resolver model. Each VCN has a DNS resolver, and you can configure custom resolvers for hybrid scenarios. It works, but it feels less polished than Azure's approach. For complex hybrid DNS scenarios, you'll probably end up running your own DNS infrastructure in both clouds.

Observability

Azure Network Watcher gives you topology views, packet capture, connection troubleshooting, NSG flow logs, and Traffic Analytics. Flow logs go to Log Analytics, and you can query them with KQL. It's comprehensive but can be overwhelming. The cost of storing flow logs in Log Analytics also sneaks up on you.

OCI's VCN Flow Logs are simpler. You enable them per subnet, they dump to Object Storage or Logging service, and you parse them yourself. There's less built-in analysis, but the raw logs are cheaper to store. If you're comfortable with log analysis tools, this is fine. If you want dashboards out of the box, Azure's further along.

Performance and Cost

This is where things get spicy. OCI's network backbone is genuinely fast. They built it recently with modern hardware, and it shows. For workloads where network latency and throughput matter - databases, especially - OCI often performs better than Azure in benchmarks.

Azure's network is more geographically distributed, though. If you need presence in 60+ regions, Azure's got you covered. OCI is growing fast but still has fewer regions.

On cost, OCI is usually cheaper for raw compute and egress. Azure's egress charges are brutal - $0.087/GB for the first 10TB in most regions. OCI charges $0.0085/GB for the first 10TB. That's not a typo. For data-intensive workloads, this difference is massive.

The Verdict?

There isn't one, not really. Azure makes sense if you're already in the Microsoft ecosystem, need global coverage, or want mature PaaS services. The networking is complex but powerful, and you can do basically anything if you're willing to learn the 47 different ways to connect things.

OCI makes sense if you're cost-conscious, running Oracle databases, or need predictable high performance. The networking is simpler but more opinionated. You'll spend less time fighting abstractions and more time understanding IP routing, which might be a plus or minus depending on your perspective.

For what it's worth, I've stopped thinking about this as an either/or question. A lot of companies are running both, using Azure for general workloads and OCI for Oracle databases or cost-sensitive batch processing. The Azure-OCI interconnect makes this surprisingly practical.

The real advice? Spin up free tiers in both, build a simple multi-tier app, and see which model clicks for you. Networking is one of those things where hands-on experience beats any article, including this one.

Network Resilience & Routing Reliability: Lessons from Real-World Cloud Systems

vaibhav bedi — Sat, 15 Nov 2025 21:48:13 +0000

When you work with large-scale cloud systems long enough, you realize one thing very quickly: the network is always the first thing blamed and the last thing actually understood.

But here's the truth — networks fail. Links go down. Hardware glitches. Someone pushes a bad config. Routing takes an unexpected path. And when that happens, everything sitting on top — APIs, microservices, storage, ML systems — starts to feel the pain.

Over the last few years working on cloud networking and traffic reliability, I've seen how much impact a well-designed (or poorly designed) network can have on availability. So I wanted to share some practical thoughts on what network resilience actually means and how routing reliability helps you survive failures without major outages.

So what is network resilience really?

Network resilience is simply the ability of your network to keep things running when something inevitably breaks.

It's not about avoiding failure — no one can do that.

It's about absorbing failure.

A resilient network:

Has redundant paths
Detects failures quickly
Moves traffic automatically
Doesn't depend on someone debugging a router at 2 AM
Recovers on its own before customers notice

If your network depends on humans reacting to alarms, it's not resilient. It's reactive.

Routing reliability: the underrated hero

Even if you build all the redundancy you want, routing is what decides whether packets actually get where they're supposed to.

Reliable routing means:

Traffic always takes a healthy path
Failovers happen fast
You avoid loops, blackholes, asymmetric paths
Your routing tables don't flap every few minutes
A single node failure doesn't blow up half the region

Cloud networks run millions of flows per second. A few seconds of routing instability can create a chain reaction.

What resilient networks look like (based on real systems)

Here are the core patterns you'll see in production cloud networks:

1. Multiple equal-cost paths everywhere

Most modern networks (AWS, OCI, GCP, Azure) use ECMP so traffic can be instantly redistributed if a link dies.

This gives you:

Higher throughput
Built-in load balancing
Immediate failover

When one path fails, traffic shifts without waiting for a human.

2. Fast, sub-second failure detection

Protocols like BGP/OSPF aren't fast enough out of the box. So you add:

BFD (Bidirectional Forwarding Detection)
Aggressive timers
Graceful restart

The goal is simple: Detect failure in milliseconds, converge the route in under a second.

3. Automated traffic engineering

In cloud environments, rerouting traffic is not a manual job.

Automation watches for:

High latency
Congested links
Flapping routes
Degraded circuits
Fiber cuts

Once it sees something off, it:

Removes the bad link from rotation
Recomputes paths
Updates routing configs
Validates that the change worked

All without anyone needing to jump on a Zoom bridge.

4. Safe, layered network architecture

A resilient network is usually built with:

Leaf-spine fabrics
Region-to-region backbones
Independent control planes
Redundant data paths
Lots of horizontal scaling

You don't rely on any single device to "never fail." Everything has a backup, and the backup also has a backup.

5. Configuration discipline (arguably the most important)

Most outages are not caused by hardware. They're caused by someone pushing a config that shouldn't have been pushed.

Strong networks use:

Automated config generation
Static and dynamic validation
Canary/gradual rollout
Automatic rollback
Change health checks

If your network team is still editing configs directly on routers… good luck.

6. Proper telemetry & observability

You can't fix what you can't see.

Good telemetry includes:

Packet drops
ECN marks
Route flaps
Latency distribution (not averages!)
Flow-level visibility

When your monitoring is good, your MTTR automatically improves.

How a real failover usually plays out

Here's what typically happens when a backbone link goes down:

BFD detects the drop
Routing protocol withdraws the route
ECMP redistributes traffic to remaining good paths
Traffic engineering notices new congestion hotspots
Automation picks alternative backbone paths
Routing configs get updated automatically
System monitors confirm stability
Traffic returns to normal

All of this usually happens in a few seconds. If humans have to intervene, your design is not resilient enough.

How you can apply these ideas to smaller environments

You don't need to be a cloud provider to use these principles. Even a small on-prem or hybrid setup benefits from:

Redundant paths
Dynamic routing (avoid static routes unless absolutely needed)
BFD for fast failure detection
Automated failover scripts
Continuous monitoring
Safe, validated config changes

If your system can survive link failures without waking someone up at night, you're already ahead.

Final Thoughts

Networks are messy. They fail in unexpected ways. They recover at the worst times. And they surprise you when you least expect it.

But if you design for failure—not hope for the best—you end up with systems that stay online even when things go wrong.

That's really what network resilience and routing reliability are all about.