Forem: Anjul Sahu

Heroku to Kubernetes Migration: Clock is ticking

Anjul Sahu — Wed, 11 Feb 2026 00:00:00 +0000

For years, Heroku has been a beloved starting point for countless high-growth companies. It was revolutionary, making the deployment of an idea almost trivial. That focus on the developer experience—on simply pushing code and having it run—is why so many successful Minimum Viable Products (MVPs) and early-stage platforms were born there. It allowed engineering leadership to focus on product-market fit (PMF) instead of infrastructure.

But a platform that simplifies everything also imposes limits, and for any company that has scaled past the initial bootstrap phase, those limits eventually hit two core metrics: control and cost. What starts as the fastest way to market often becomes a budget bottleneck and a strategic constraint.

Today, with new structural changes at Heroku, the conversation about migration is no longer a matter of "if" or "when," but "now." For any business running a production-critical, profitable service, moving to Kubernetes is no longer just an optimization—it’s a necessary step to secure the next decade of growth and maintain technical sovereignty.

Understanding the Shift at Heroku

On February 6, 2026, Heroku announced a significant strategic realignment. The platform is now transitioning to what they call a sustaining engineering model.

What does that actually mean for you as a business? It means a shift in investment priority. Heroku remains a stable, production-ready environment, with continued focus on core areas like security, stability, reliability, and support. For existing credit card-paying customers, the day-to-day operations and services remain unchanged.

The critical piece of news, however, is that Enterprise Account contracts will no longer be offered to new customers. While existing enterprise contracts will be honored, this decision sends a clear strategic signal: Salesforce, the parent of Heroku, is focusing its future engineering efforts elsewhere—specifically on helping organizations build and deploy enterprise-grade AI in a secure way, rather than focusing on the core, undifferentiated platform features that many growth companies rely on.

In short, the platform you relied on for your MVP is telling you, quite clearly, that its main focus is changing. For a high-growth business, relying on a platform that has decided to stop innovating in your core area of need is an unacceptable risk. The decision to migrate has now moved from a "good idea" to a strategic imperative.

Why is Kubernetes a good choice?

The cloud landscape has matured dramatically since Heroku first took center stage. While Heroku pioneered the developer-first experience, Kubernetes is already an industry standard and majority of the companies are already using it in Production. For any company that has achieved PMF, Kubernetes offers benefits that directly address the pain points of a scaled Heroku implementation. You may ask why not using products like Portainer, Render, Fly etc which have been an alternative. Yes, you can use them but it is still gaining more control on the platform and spending.

Reclaiming Sovereignty and Control

With Heroku, you are a tenant in a strictly controlled environment. That simplicity is powerful, but it comes at the cost of ultimate control. Kubernetes flips that dynamic. It gives you the blueprint for your entire infrastructure.

Multicloud and Hybrid Strategy: Kubernetes is a universal API for infrastructure. It provides the freedom to easily shift workloads between major cloud providers (AWS, GCP, Azure), deploy on-premise, or adopt a hybrid strategy. This ability to change providers is a powerful negotiating tool and a key piece of business continuity planning.
Enterprise Sales Enablement: For B2B SaaS, especially those with AI-native features, enterprise customers often require strict data sovereignty. They need to self-host services on their own virtual private clouds or on-premise. Heroku architecture simply cannot support this. A Kubernetes-based platform enables you to offer a self-deployed version of your SaaS product, unlocking massive new markets in highly regulated or security-conscious industries. The control Kubernetes offers over data residency and compliance is non-negotiable for selling to large enterprise customers.

Scalability and Cost Efficiency

The Heroku pricing model is famously straightforward: it’s easy to calculate, but it is expensive as you scale. This is the trade-off for simplicity.

By moving to Kubernetes, you gain fine-grained control over resource allocation. You can right-size your instances, consolidate workloads, and select the most cost-effective machine types for specific tasks. While the initial setup requires more attention, the long-term cost savings are significant, especially for services with unpredictable or high-volume usage.

The ecosystem itself has worked to smooth out the initial complexity. Major cloud providers now offer "autopilot" in their managed Kubernetes services that handle much of the underlying operational overhead. This means you can gain the cost and control benefits of Kubernetes without the burden of building a huge platform engineering team.

At CloudRaft, we recognize the need to simplify this process. We’ve built an accelerator called TurboRaft that is essentially a proven playbook for the modern Kubernetes platform. It includes:

GitOps with ArgoCD: For zero-touch, automated, and auditable releases.
Security: Secured secret management, automated certificate management, SAST, SBOMs and vulnerability management.
Observability: Open-source monitoring with options to choose from and alerting to keep costs low while maintaining deep insight.
Governance: Clear policies enforced for compliance and cost control.

The goal is to deliver the "Heroku-like" ease of use for developers, but on a platform you own and control.

Maturation of the Kubernetes ecosystem

A few years ago, managing Kubernetes was a job for seasoned experts. Today, the complexity angle has been largely mitigated by a robust and mature ecosystem. Open-source tooling, managed cloud services, and a deep community knowledge base have all contributed to making K8s a practical and reliable choice.

The old argument that "Kubernetes is too complex" is mostly obsolete for a growing company. The market has solved the hardest parts. What’s left is a highly stable platform that provides the operational rigor required to run business-critical services. The Hacker News discussion thread on the Heroku news highlights this exact sentiment, with many leaders realizing that the ecosystem is ready for them.

A structured approach to migration

No platform migration is easy; it’s a non-trivial engineering effort that must be planned as a business-critical project. Done correctly, it is an opportunity to not just move your app, but to make it stronger and more resilient for the future.

Step 1: Assessment and Re-Architecture

This is the most crucial phase. A migration should also be seen as a refactoring opportunity. If your application isn't strictly following cloud-native principles or the Twelve-Factor App methodology, now is the time to correct it.

Risk Identification: We begin with a full risk assessment, examining each service in the application. We categorize them by current stability, coupling, and size to create a phased migration plan.
Sizing and Cost Modeling: Understanding the true resource needs of each service allows us to create accurate Kubernetes deployment specifications and a detailed cost projection for the new platform.

Step 2: Simplifying the Developer Experience

The biggest win of Heroku was the abstraction of infrastructure. We need to replicate that ease of use on Kubernetes. Developers should not need to become Kubernetes experts overnight.

We convert services into Kubernetes deployments using Helm charts, then we abstract the low-level Kubernetes constructs. The goal is a simplified interface—whether it’s a basic YAML or JSON configuration—that lets developers manage their application settings without worrying about the underlying cluster management. This retains the core developer efficiency that made Heroku so appealing.

Step 3: The Data Migration Challenge

Applications are often the easy part; the database is where the real complexity lies. A successful migration requires a strategy for moving data with near-zero downtime.

We strongly recommend self-hosted database solutions on Kubernetes, particularly CloudNativePG for PostgreSQL. Running your own highly-available, self-managed database on Kubernetes removes the premium cost of proprietary cloud-managed services while providing superior control over failover and disaster recovery. We’ve found CloudNativePG to be highly reliable and offer full consulting and support to ensure a smooth, near-zero-downtime data migration. The database upgrade and management was easy in Heroku and with CloudNativePG and our best practices, you can have the database on auto pilot.

Time to act is now

The shift at Heroku is a clear alarm bell. Ignoring it means accepting escalating costs and a growing strategic risk. You now have a proven, mature, and cost-effective alternative in Kubernetes.

Success in this migration hinges on two things:

Selecting a Proven Playbook: You need a tested, end-to-end framework that accounts for application, database, and operational complexities.
The Right Team: You need a partner who has navigated this journey before and can deliver the platform quickly, abstracting away the unnecessary complexity while leaving you with full control.

This is where CloudRaft comes in. We offer not just the accelerator, but the consulting and operational support to execute the migration and hand over a platform that is ready for enterprise-level growth. Don't wait until the cost pressure or strategic uncertainty becomes a crisis—secure your future with a modern, controlled, and cost-efficient Kubernetes platform today.

Context Graphs for AI Agents: The Complete Implementation Guide

Anjul Sahu — Thu, 29 Jan 2026 00:00:00 +0000

Why Context Graphs Matter Now for AI Agents?

In the past few months, AI has shifted from chatbots to agents, autonomous systems that don't just answer questions but make decisions, approve exceptions, route escalations, and execute workflows across enterprise systems. Foundation Capital recently called this shift AI's "trillion-dollar opportunity," arguing that enterprise value is migrating from traditional systems of record to systems that capture decision traces, the "why" behind every action.

But here's the problem: agents deployed without proper context infrastructure are failing at scale, with customers reporting "1,000+ AI instances with no way to govern them" and "all kinds of agentic tools that none talk to each other" as stated in Metadata Weekly. The issue isn't the AI models themselves, it's that agents lack the structured knowledge foundation they need to reason reliably.

The Missing Infrastructure: Relationship-Based Context

47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024 MIT Sloan Management Review. Even when agents don't hallucinate outright, they struggle with multi-step reasoning that requires connecting distant facts across systems. An agent might know a customer filed a complaint and know about a recent product defect and know the refund policy, but fail to connect these relationships to understand why an exception should be granted.

As Prukalpa Sankar, co-founder of Atlan, frames it: "In 2025, in the dawn of the AI era, context is king" in her article. Context Graphs provide this missing infrastructure by organizing information as an interconnected network of entities and relationships, enabling AI agents to traverse meaningful connections, reason across multiple facts, and deliver explainable decisions.

This comprehensive guide explains what Context Graphs are, how they work, and why they're becoming essential infrastructure for enterprise AI.

What is a Context Graph? Definition, Use Cases & Implementation Guide

How Context Graphs Work

Context Graphs transform raw data into a semantic network of nodes (entities like people or projects), directed edges (relationships such as "worked_on" or "depends_on"), and properties (key-value details on both). This structure enables AI agents to perform graph traversals, starting from a query node and following relevant edges, for dynamic context assembly and multi-hop reasoning, unlike rigid keyword or vector searches.

Core Components:

Nodes: Represent real-world entities (e.g. "ProjectX"). Each holds properties like name, type, or timestamp.
Edges: Directed connections with types (e.g. → "worked_on" →) and properties (e.g. role: "lead", duration: "6 months"). Directions indicate flow, like cause-effect.
Properties: Metadata attached to nodes/edges (e.g., confidence score on an edge), enabling filtered traversals.

Traversal Process:

Query Entry: Input like "API security projects" matches starting nodes via properties or embeddings.
Neighbor Expansion: Fetch adjacent nodes/edges, prioritizing by relevance (e.g., recency, strength).
Multi-Hop Pathfinding: Traverse 2-4 hops (e.g. Project → worked_on → Engineer → similar_to → AuthSystem), using algorithms like BFS or HNSW-inspired graphs for efficiency.
Context Assembly: Aggregate paths into a subgraph, feeding it to LLMs for grounded reasoning.
Explainability: Log the path for auditing.

This mirrors vector DB indexing (e.g. HNSW in Pinecone) but emphasizes relational paths over pure similarity.

Example in Action:

Traditional Vector Search (e.g., Pinecone nearest-neighbor): "API security projects" → Returns docs with similar embeddings (e.g. 3 keyword matches).

Context Graph Traversal:

# sample cypher query
MATCH (p:Project)-[:RELATED_TO]->(t:Topic {name: 'API Security'})-[*1..3]-(related) RETURN *

Start: Projects tagged "API Security".
Hop 1: → worked_on_by → Engineers (properties: skills="OAuth").
Hop 2: Engineers → also_worked_on → AuthSystems.
Hop 3: AuthSystems → depends_on → OAuthProtocols (properties: version="2.0").
Output: Subgraph with projects, team, deps, contributors—plus path visualization for explainability.

Key Characteristics of Context Graphs

Relationship-Centric Design: Context Graphs prioritize connections over isolated records. This makes it natural to understand how concepts relate, not just what they contain.
Multi-Hop Reasoning: The graph structure enables AI to connect distant concepts through intermediate relationships, reasoning across multiple steps just as humans do. Example: Connecting "customer complaint" → "product defect" → "supplier issue" → "quality control process" in three hops.
Dynamic Context Assembly: Rather than retrieving fixed search results, Context Graphs assemble context on the fly by traversing only the relationships relevant to your specific query.
Built-in Explainability: Every AI decision can be traced back through its relationship path. You can see exactly how the system reached a conclusion, critical for enterprise and regulated environments.
Temporal Intelligence: Context Graphs model sequences, dependencies, and cause-and-effect relationships over time, making them ideal for understanding evolving processes and events.
Enterprise Scalability: Modern graph databases handle millions of entities while maintaining fast traversal and query performance at scale.

Context Graph vs Knowledge Graph vs Vector Database

Feature	Context Graph	Knowledge Graph	Vector Database
Primary Focus	Contextual relationships for AI reasoning	General knowledge representation	Semantic similarity matching
Reasoning Type	Multi-hop traversal	Structured queries	Nearest neighbor search
Best For	Dynamic AI context assembly	Structured domain knowledge	Semantic search, RAG
Explainability	High (shows relationship paths)	Medium	Low (similarity scores only)
Query Complexity	Complex multi-step reasoning	Medium complexity	Simple similarity queries

Note: These technologies complement each other. Many advanced AI systems use Context Graphs for reasoning combined with vector databases for semantic search.

Real-World Context Graph Use Cases

Enterprise Knowledge Management: Connect projects, people, decisions, and outcomes across your organization. Instead of finding where files live, trace how work evolved, what decisions shaped results, and who has relevant expertise. This will reduce your knowledge discovery time.

Intelligent Customer Support: Go beyond keyword matching. Connect customer history, product configurations, known issues, and documented resolutions to provide contextually accurate answers. This will reduce your ticket resolution time.

Scientific Research & Discovery: Connect millions of research papers, creating networks of studies, methodologies, findings, and citations. Discover unexpected connections between seemingly unrelated fields. You can identify underexplored research areas by analyzing relationship patterns and citation gaps.

Compliance & Risk Management: Map relationships between regulations, internal policies, business processes, and controls. When requirements change, trace exactly where those changes affect systems and workflows. This will reduce your compliance audit preparation time.

Healthcare Diagnostics: Connect symptoms, medical history, medications, genetic factors, and research findings. Enable diagnostic systems to reason across these relationships and identify conditions that isolated analysis might miss. This will improve diagnostic accuracy by surfacing relevant but non-obvious connections.

Supply Chain Optimization: Model your entire supply network, suppliers, components, products, logistics partners, enabling sophisticated scenario analysis and rapid disruption response. For example, when supply issues arise, it will quickly identify alternative suppliers by traversing compatibility, certification, and performance relationships.

Legal Research & Analysis: Map relationships between cases, statutes, legal principles, and precedents. Trace how legal concepts evolved across jurisdictions and time periods. This would reduce legal research time.

Personalized Recommendations: Go beyond "customers who bought this also bought that." Understand topical relationships, creator connections, and contextual relevance to deliver truly personalized recommendations. This would increase engagement through unexpected but relevant discoveries.

Financial Risk Assessment: Model relationships between entities, transactions, accounts, and market factors. Detect complex fraud patterns spanning multiple accounts and understand how risks cascade through connected entities. This would detect more fraud patterns than traditional rule-based systems.

Software Development Intelligence: Map relationships between functions, modules, dependencies, documentation, and issues. Understand how code changes ripple through your system before making modifications. This would reduce breaking changes through comprehensive impact analysis.

Benefits of Context Graphs for AI Agents

Reduce AI Hallucinations: Ground AI outputs in explicit, verifiable relationships rather than probabilistic pattern matching alone.
Improve Reasoning Accuracy: When answers require connecting multiple facts across domains, Context Graphs significantly outperform retrieval-only approaches.
Enable Explainable AI: Expose the exact path the AI took through your knowledge graph, making decisions transparent and auditable.
Scale Without Schema Rigidity: Add new entity types and relationships without forcing disruptive schema migrations.
Surface Hidden Insights: Discover patterns and connections that are nearly impossible to detect in traditional table or document structures.
Maintain Context Across Interactions: Preserve relationship context throughout multi-turn conversations, enabling more sophisticated AI interactions.

How to Implement Context Graphs

Step 1: Select Your Graph Database

Choose based on scale, query patterns, and infrastructure:

Some Popular Options:

Neo4j: Most mature, enterprise-ready, excellent query language
Amazon Neptune: Managed AWS service, good for existing AWS infrastructure
TigerGraph: Best for massive scale and complex analytics
ArangoDB: Multi-model database with graph capabilities
FalkorDB: Ultra-fast in-memory graph database built on Redis, best for low-latency real-time applications

Decision Factors: Query complexity, data volume, team expertise, budget

Step 2: Design Your Relationship Schema

The value of a Context Graph depends on modeling the right entities and relationships.

Best Practice: Collaborate closely with domain experts who understand:

What entities matter in your domain
Which relationships drive important decisions
How information flows through your processes

Example Schema (Customer Support):

Entities: Customer, Ticket, Product, Issue, Resolution, Agent
Relationships: reported_by, relates_to, resolved_with, escalated_to, similar_to

Step 3: Build Entity Extraction

Identify entities in your source data:

For Unstructured Text:

Use NLP pipelines
Fine-tune LLMs for domain-specific entity recognition
Implement human-in-the-loop validation for critical entities

For Structured Data:

Map existing database fields directly to graph entities
Normalize entity references across systems

Step 4: Develop Relationship Extraction

Beyond identifying entities, determine how they relate:

Approaches:

Rule-based: Define explicit patterns (if X mentions Y in context Z, create relationship R)
ML-based: Train models to identify relationship types from text
LLM-based: Use large language models for sophisticated relationship inference
Human validation: Review critical relationship paths

Step 5: Enable Real-Time Updates

Context Graphs are living systems requiring continuous updates:

Implement event-driven architecture for data changes
Design incremental update patterns (don't rebuild everything)
Maintain data lineage for troubleshooting
Build conflict resolution for concurrent updates

Step 6: Optimize Query Performance

Keep multi-hop queries responsive at scale:

Index critical properties used in traversals
Cache frequent query patterns
Limit traversal depth for expensive queries
Denormalize selectively for performance-critical paths
Use query profiling to identify bottlenecks

Step 7: Integrate Graph Analytics

Enhance your Context Graph with advanced algorithms:

PageRank: Identify influential nodes
Community Detection: Find clusters of related entities
Path Finding: Discover optimal routes through relationships
Graph Embeddings: Enable similarity calculations
Link Prediction: Suggest missing relationships

Implementation Challenges & Solutions

Challenge	Why It Matters	Practical Solution
Graph Construction Complexity	Building comprehensive graphs requires sophisticated entity and relationship extraction from unstructured data	Start with a focused domain where you have high-quality structured data. Expand gradually as you build extraction capabilities.
Schema Design Expertise	Effective schemas demand deep domain understanding, poor design leads to unusable graphs	Run workshops with subject matter experts. Build iteratively: start simple, refine based on actual query patterns.
Performance at Scale	Graph traversals become expensive for complex multi-hop queries as data grows	Invest in proper indexing, implement query optimization, use caching strategically, and set traversal depth limits (2-4 hops).
Entity Resolution	Identifying that different mentions refer to the same entity is difficult but critical for accuracy	Implement fuzzy matching, leverage unique identifiers where available, use ML-based entity resolution tools, maintain a golden record system.
Quality Maintenance	As graphs grow to millions of relationships, maintaining accuracy becomes challenging	Implement automated validation rules, schedule periodic audits, track data lineage, enable user feedback loops for corrections.
Integration Complexity	Incorporating Context Graphs into existing systems requires architectural changes and API design	Build a graph API layer that existing systems can call. Start with read-only integration, add write capabilities once proven.
Skill Gap	Shortage of professionals experienced in graph technologies and query languages like Cypher	Train existing team members (graph databases are learnable, similar to SQL), hire contractors for initial setup, or partner with CloudRaft for implementation guidance.
Cost Management	Context Graphs add infrastructure costs for databases, extraction pipelines, and real-time analytics	Start with a high-value use case to demonstrate ROI. Scale infrastructure based on actual usage patterns. Monitor cost per query and optimize expensive operations.

Context Graph Best Practices

Design Principles

Model relationships that drive decisions: Don't create relationships just because you can. Focus on connections that enable valuable reasoning.
Keep entity types focused: Avoid creating overly granular entity types. Each entity type should represent a meaningful concept in your domain.
Make relationships meaningful: Generic relationships like "related_to" provide little value. Use specific relationship types: "depends_on," "caused_by," "replaces."
Balance normalization and performance: Highly normalized graphs are elegant but can be slow. Denormalize strategically for frequently traversed paths.
Version your schema: Graph schemas evolve. Maintain version history and migration paths.

Query Optimization

Limit traversal depth: Set maximum hops to prevent runaway queries. Most valuable relationships are within 2-4 hops.
Filter early: Apply constraints as early as possible in your traversal to reduce the working set.
Use indexed properties: Index properties you filter on frequently. This dramatically improves query performance.
Cache common patterns: Identify frequently executed query patterns and cache results with appropriate TTLs.

Data Quality

Implement validation rules: Define constraints on entity properties and relationship validity to maintain quality automatically.
Track provenance: Know where each entity and relationship came from. This enables troubleshooting and quality assessment.
Enable feedback loops: Allow users to report incorrect relationships. Use this feedback to improve extraction pipelines.
Schedule audits: Periodically review graph quality, especially for critical relationship paths.

Context Graphs + LLMs: A Powerful Combination

Context Graphs and Large Language Models (LLMs) complement each other:

Graph-Augmented Generation (GAG): Retrieve relevant subgraphs from your Context Graph and provide them as structured context to LLMs. This reduces hallucinations and grounds responses in your actual knowledge.

LLM-Assisted Graph Construction: Use LLMs to extract entities and relationships from unstructured text, building your Context Graph more quickly than rule-based approaches alone.

Explainable LLM Reasoning: When LLMs generate responses based on graph context, you can trace exactly which relationships influenced the output.

Hybrid Retrieval: Combine vector search (for semantic similarity) with graph traversal (for relationship reasoning) to get the best of both approaches.

Measuring Context Graph Success

Track these metrics to assess your Context Graph implementation:

Query Performance

Response time: Median and 95th percentile query latency
Throughput: Queries per second at peak usage
Cache hit rate: Percentage of queries served from cache

Data Quality

Entity accuracy: Percentage of correctly identified entities
Relationship precision: Percentage of relationships that are actually valid
Coverage: Percentage of domain knowledge captured in the graph

Business Impact

Time saved: Reduction in research/discovery time
Accuracy improvement: Better decision quality from enhanced reasoning
Cost reduction: Decreased manual effort for knowledge work
User satisfaction: NPS or satisfaction scores for graph-powered features

AI Performance

Hallucination rate: Reduction in factually incorrect AI outputs
Reasoning accuracy: Percentage of multi-hop questions answered correctly
Explainability: Percentage of AI decisions with traceable reasoning paths

The Future of Context Graphs

Context Graphs are evolving rapidly:

Emerging Trends

Graph + Vector Hybrid Systems: Combining semantic vector search with graph reasoning for more sophisticated AI systems.
Automated Schema Evolution: ML systems that automatically suggest new entity types and relationships based on usage patterns.
Real-Time Graph Analytics: Stream processing for graph updates and real-time pattern detection.
Multi-Modal Graphs: Incorporating images, audio, and video as first-class entities with rich relationships.
Federated Graphs: Connecting knowledge graphs across organizational boundaries while maintaining privacy and security.

Getting Started with Context Graphs

Ready to implement Context Graphs in your AI systems?

Start Small, Think Big

Identify a high-value use case where relationship reasoning matters
Map your initial schema with domain experts (10-20 entity types is plenty to start)
Build a proof of concept with a subset of your data
Measure impact against your baseline approach
Iterate and expand based on what you learn

Common Starting Points

Customer support: Connect tickets, customers, products, and resolutions
Internal knowledge: Link documents, projects, people, and decisions
Compliance: Map regulations, policies, processes, and controls
Product development: Connect features, dependencies, bugs, and releases

Conclusion

Context Graphs represent a fundamental shift in how AI systems understand and reason about information. By capturing not just data, but the rich network of relationships that gives data meaning, they unlock AI capabilities that were previously unattainable:

More accurate reasoning through multi-hop traversal
Explainable decisions via traceable relationship paths
Reduced hallucinations by grounding in verifiable connections
Scalable knowledge management without rigid schema constraints

As AI becomes increasingly central to enterprise operations, Context Graphs will evolve from competitive advantage to foundational infrastructure. Organizations that build graph-based AI capabilities now will be well-positioned to lead in an AI-driven future.

The question isn't whether to adopt Context Graphs, it's when and where to start.

Expert Help with Context Graph Implementation

Building Context Graphs requires specialized expertise in graph databases, knowledge representation, and AI integration. CloudRaft provides complimentary AI consultations to help you:

Assess feasibility for your specific use cases
Design optimal schemas for your domain
Architect scalable infrastructure that grows with your needs
Integrate with existing AI systems seamlessly
Train your team on graph technologies

Frequently Asked Questions

What's the difference between a Context Graph and a Knowledge Graph?

Context Graphs are specialized knowledge graphs optimized for dynamic context assembly in AI systems. While knowledge graphs broadly represent domain knowledge, Context Graphs focus specifically on enabling AI reasoning through relationship traversal.

Can I use Context Graphs with vector databases?

Absolutely. Many advanced AI systems use both, vector databases for semantic similarity search and Context Graphs for relationship reasoning. This hybrid approach provides the best of both worlds.

How much data do I need to start?

You can start small. Even a few thousand entities with well-modeled relationships can demonstrate value. Focus on quality relationships over quantity.

What's the typical implementation timeline?

For a focused proof of concept: 4-8 weeks. For production-ready implementation: 3-6 months. Timeline depends on data complexity, schema design, and integration requirements.

Do I need specialized graph database skills?

While helpful, they're not mandatory. Graph query languages like Cypher (Neo4j) are learnable, similar to SQL. Consider training existing team members or partnering with experts for initial setup.

How do Context Graphs reduce AI hallucinations?

By grounding AI responses in explicit, verifiable relationships rather than relying solely on probabilistic pattern matching from training data. The AI can only traverse relationships that actually exist in your graph.

What's the ROI of implementing Context Graphs?

Varies by use case, but organizations typically see: reduction in knowledge discovery time, improvement in AI reasoning accuracy, and reduction in manual research effort. ROI is highest for knowledge-intensive workflows.

Can Context Graphs work with my existing databases?

Yes. Context Graphs complement existing databases. You can keep transactional data in relational databases and build Context Graphs for relationship reasoning, syncing data between systems.

Real-Time Postgres to ClickHouse CDC: Supercharge Analytics with PeerDB

Anjul Sahu — Thu, 27 Nov 2025 00:00:00 +0000

If you are running a heavy SaaS platform, you eventually hit a wall with PostgreSQL. It's fantastic for transactional data (OLTP), but when you try to run complex analytical queries on millions of rows, things slow down.

We recently tackled this exact problem for a client handling high-volume messaging operations. In one of our customers' analytics dashboards, they were using an AWS Aurora PostgreSQL setup to run analytical queries, and they needed a solution that was fast, reliable, and real-time.

Here is how we solved it by building a high-performance replication pipeline from Postgres to ClickHouse using PeerDB.

Why ClickHouse?

ClickHouse is the superior choice for analytics because it is a purpose-built OLAP database designed for high-performance data processing, unlike PostgreSQL, which is a row-based OLTP system better suited for transactional workloads. Its columnar storage architecture allows it to handle massive datasets with sub-millisecond latency, where standard Postgres deployments often hit performance walls. By switching to ClickHouse, you gain the ability to ingest millions of rows and execute complex analytical queries instantly, solving the performance limitations inherent in using PostgreSQL for analytics.

The CDC Landscape: Why We Chose PeerDB

Real-time Change Data Capture (CDC) is the standard for moving data without slowing down your primary database. But how do you implement it? Here are the primary CDC (Change Data Capture) options for replicating data from PostgreSQL to ClickHouse that we have considered in our implementation.

1. PeerDB

PeerDB is a specialised tool designed specifically for PostgreSQL to ClickHouse replication. It was the chosen solution in the provided design document due to its balance of performance and simplicity.

Architecture: It can run as a Docker container stack (PeerDB Server, UI, etc.) and connects directly to the Postgres logical replication slot.

Pros:

High Performance: PeerDB was found to be very performant as compared to other solutions.
Specialised Features: It handles initial snapshots (bulk loads) and real-time streaming (CDC) seamlessly. It also supports specific optimisations, such as dividing tables into multiple "mirrors" to speed up initial loads.
Simplicity: It avoids the complexity of managing a full Kafka cluster. Cons:
Community Edition Limits: The community edition lacks built-in authentication for the UI, requiring private network access or VPNs for security or another way to add authentication for the UI.

2. Altinity Sink Connector for ClickHouse

This is a lightweight, single-executable solution often used to avoid the complexity of Kafka. It is developed by Altinity, a major ClickHouse contributor.

Architecture: It runs as a standalone binary or within a Kafka Connect environment. It connects to Postgres and replicates data to ClickHouse.

Pros:

Operational Simplicity: It eliminates the need for a Kafka Connect cluster or ZooKeeper, running as a single executable.
Direct Replication: Offers a direct path from Postgres to ClickHouse.
Auto Schema: Can automatically read the Postgres schema and create equivalent ClickHouse tables.
Cons:
- Performance: In the referenced document, this option was tested but rejected because it did not meet the performance requirements compared to PeerDB.

3. Debezium and Kafka

This is the industry-standard approach for general-purpose CDC, involving a chain of distinct complex components.

Architecture: Postgres → Debezium (Kafka Connect) → Kafka Broker → ClickHouse Sink → ClickHouse.

Pros:

Decoupling: The message broker (Kafka) decouples the source from the destination, allowing multiple consumers to read the same stream.
Reliability: Extremely robust for guaranteed message delivery and exactly-once processing (if configured correctly). Cons:
High Complexity: Requires managing Zookeeper, Kafka Brokers, and Schema Registries. The provided document explicitly mentions avoiding "Kafka Connect framework complexity" as a goal.
Overhead: Significant infrastructure footprint compared to direct replication tools.

Why PeerDB?

We initially tested the Altinity connector but ultimately chose PeerDB. Mainly because of following reasons.

Performance: In our testing, PeerDB offered superior performance for our specific workload compared to other connectors we tried.
Specialisation: It is purpose-built for Postgres-to-ClickHouse replication, handling data type mapping and initial snapshots smoothly.

The Architecture

We opted for a "Keep It Simple" approach to infrastructure. While Kubernetes (EKS) is great, we deployed this on Amazon EC2 to maintain full control over the infrastructure and cost. If you have a team that can handle EKS for you, then that might be a better option. Please discuss with our team to find the right solutions for your workload and team.

The Setup:

Source: AWS Aurora (PostgreSQL)
The Pipeline: PeerDB running via Docker Compose
Destination: A ClickHouse cluster

High Availability Design

Source: Altinity

To ensure we never lost data, we configured a ClickHouse cluster with:

3 Keeper Nodes: Using m6i.large instances. These replace ZooKeeper for coordination
2 ClickHouse Server Nodes: Using r6i.2xlarge instances for heavy lifting
Replication: We used ReplicatedMergeTree to ensure data exists on multiple nodes for safety

ClickHouse Cluster

We automated the deployment using Ansible to configure the hardware-aware settings. A cool feature of our setup is that the configuration automatically calculates memory limits and cache sizes based on the EC2 instance's RAM (e.g., leaving 25% for the OS and giving 75% to ClickHouse). We wrote about this earlier in our previous blog post.

Installing PeerDB

We used Docker Compose to spin up the PeerDB stack. One specific nuance we encountered was configuring the storage abstraction. PeerDB uses MinIO (S3 compatible) for intermediate storage. We had to explicitly set the PEERDB_CLICKHOUSE_AWS_CREDENTIALS_AWS_ENDPOINT_URL_S3 environment variable in our docker-compose.yml to point to our MinIO host IP.

Set up the peers to connect with the source and destination.

Creating the "Mirror"

PeerDB uses a concept called Mirrors to handle the CDC pipeline. We set up the connection by defining:

The Peer (Source): Our Aurora Postgres instance
The Peer (Destination): Our ClickHouse cluster
The Mirror: The actual replication job

PeerDB support different modes of streaming - log based (CDC), cursor based (timestamp or integer) and XMIN based. In our implementation, we used log based (CDC) replication.

To optimise the initial data load, we didn't just dump everything at once. We divided our tables into multiple "batches" (mirrors) to run in parallel and started at different times so that we do not cause a high load on the source.

"Gotchas" From the Trenches

No migration is perfect. Here are three issues we faced so you can avoid them:

The "Too Many Parts" Error in ClickHouse

ClickHouse loves big batches of data. If PeerDB syncs records one by one or in tiny groups too quickly, ClickHouse can't merge the data parts fast enough in the background. We saw errors like Too many parts... Merges are processing significantly slower than inserts.

Fix: You may need to tune the batch size or frequency to slow down the inserts slightly, allowing ClickHouse's merge process to catch up.

Aurora Failovers Break Things

If AWS Aurora triggers a failover, the IP/DNS resolution might shift. We found that this can break the peering connection.

Fix: You have to edit the peer configuration to point to the new primary host and resync the mirror.

Security on Community Edition We used the community edition of PeerDB. Be aware that it does not have built-in authentication for the UI.

Fix: Do not expose the UI to the public internet. We access it via private IP/VPN or put an authentication layer using a third-party product.

Conclusion and Key Takeaways

By successfully moving analytical queries off the primary Postgres instance and into ClickHouse, we achieved the sub-millisecond query performance our client required. PeerDB provided us with a robust, real-time CDC solution without the operational headache of managing a Kafka cluster.

Key Takeaways on the Postgres + ClickHouse + PeerDB Combination:

Performance: You get the best of both worlds: PostgreSQL handles fast, reliable transactional (OLTP) workloads, while ClickHouse takes on complex analytical (OLAP) queries with unmatched speed. This separation prevents slow analytical queries from impacting your core application database.
Real-Time Simplicity: PeerDB acts as a purpose-built, high-performance bridge. It removes the need to deploy and manage a complex, multi-component CDC stack like Debezium and Kafka, significantly reducing infrastructure complexity and operational overhead.
Scalability: This architecture allows your analytics layer (ClickHouse) to scale independently from your transactional layer (Postgres), ensuring that as your data volumes grow, you maintain both OLTP stability and OLAP speed.
Cost-Effectiveness: By offloading analytical processing, you can often run a smaller, more cost-effective Postgres instance dedicated to its core function, while leveraging ClickHouse's efficiency for massive-scale querying.

Are you looking to improve your analytics pipeline? Please book a call with us to discuss your case.

Why high performance storage is important for AI Cloud Build

Anjul Sahu — Wed, 24 Sep 2025 00:00:00 +0000

The AI cloud market is experiencing exceptionally rapid growth worldwide, with the latest reports projecting annual growth rates between 28% and 40% over the next five years. It may reach up to $647 billion by 2030 as per various analyst reports. The surge in AI Cloud adoption, GPU-as-a-service platforms, and enterprise interest in AI “factories” has created new pressures and opportunities for product engineering and IT leaders. Regardless of which public cloud or private cluster you choose, one key differentiator sets each AI and HPC solution apart: the performance of storage.

While leading clouds often use the same GPUs and servers, the way data flows—between compute, network, storage, and persistent layers—determines everything from training speed to scalability. Understanding storage fundamentals will help you architect or select the right solution. We have previously covered how to build AI cloud solutions and with hands-on experience in this space, we would like to cover our thoughts around it in this article.

Business and technology leaders now recognize that real-world AI breakthroughs require infrastructure with high bandwidth, low latency, and extreme parallelism. As deep learning and data-intensive analytics move from labs to production, GPU clusters run ever-larger models on ever-growing datasets.

Why Does Storage Matter in AI Workloads?

Storage plays an important role across the entire AI lifecycle. Let’s look into all three major areas: data preparation, training & tuning, and inference.

Data Preparation

Key Tasks

Scalable and performant storage to support transforming data for AI use
Protecting valuable raw and derived training data sets

Critical Capabilities

Storing large structured and unstructured datasets in many formats
Scaling under the pressure of map-reduce like distributed processing often used for transforming data for AI
Support for file and object access protocols to ease integration

Training & Tuning

Key Tasks

Providing training data to keep expensive GPUs fully utilized
Saving and restoring model checkpoints to protect training investments

Critical Capabilities

Sustaining read bandwidths necessary to keep training GPU resources busy
Minimizing time to save checkpoint data to limit training pauses
Scaling to meet demands of data parallel training in large clusters

Inference

Key Tasks

Safely storing and quickly delivering model artifacts for inference services
Providing data for batch inferencing

Critical Capabilities

Reliably storing expensive to produce model artifact data
Minimizing model artifact read latency for quick inference deployment
Sustaining read bandwidths necessary to keep inference GPU resources busy

High Performance Storage is Critical in Checkpointing Process in AI Training

Checkpointing is a critical process in large-scale AI training, enabling models to periodically save and restore their state as training progresses. As model and dataset sizes expand into the billions of parameters and petabytes of data, this operation becomes increasingly demanding for storage infrastructure. Efficient checkpointing helps safeguard training progress against inevitable hardware failures and disruptions, while also allowing for fine-tuning, experimentation, and rapid recovery. However, frequent checkpointing can introduce performance overhead due to pauses in computation and intensive reads/writes to persistent storage, especially when distributed clusters grow to thousands of accelerators.

To address these challenges, modern AI storage architecture leverages strategies such as asynchronous checkpointing—where checkpoints are saved in the background, minimizing idle time—and hierarchical distribution, reducing bottlenecks by having leader nodes manage data transfers within clusters. The result is faster training throughput, lower risk of lost work, and more efficient use of compute resources. Optimizing for checkpoint size, frequency, and concurrent access patterns is vital to ensure high throughput and low latency, making high-performance scalable storage systems an indispensable foundation for reliable, cost-effective AI model training at scale. You can read more about it in this AWS article.

What Kind of Storage Is Needed for AI and HPC Workloads?

For AI and HPC workloads, the demands extend well beyond ordinary enterprise storage. Key requirements include:

Parallel File Systems: Multiple servers and GPUs need to access datasets at the same time. Systems such as Lustre, WEKA, VAST Data, CephFS, and DDN Infinia enable concurrent access, avoiding bottlenecks and improving throughput for distributed workloads.
High Throughput and Low Latency: Training GPT-like models or running simulations generates millions of read/write operations per second. Storage must deliver bandwidth in the tens to hundreds of GB/s and latency below 1ms, so that GPUs remain fed and productive.
POSIX Compliance: Many AI frameworks and HPC applications expect a traditional POSIX interface for seamless operation.
Scalability and Elasticity: Petabyte-scale capacity is the norm. Modern solutions allow you to scale horizontally, adding performance and capacity as demand grows.
Data Integrity and Reliability: Enterprise-grade AI and HPC workloads need uninterrupted access to their data. Redundancy, fault tolerance, and robust disaster recovery features matter.

Typical Storage Specifications and Requirements

For modern AI Cloud or AI factory, and GPU Cloud infrastructure, expect:

Bandwidth: 15–512 GB/s (or higher for top-tier solutions)
IOPS: From 20,000 (entry) up to 800,000+
Latency: Sub-1ms to 2ms for parallel file systems
Capacity: 100TB to multi-petabyte scale, often with tiering to object storage
Protocols: NFSv3/v4.1, SMB, Lustre, S3 (for hybrid and archival storage), HDFS, and native REST APIs

On-premises or hybrid deployments may include NVMe storage, CXL-enabled expansion, and advanced cooling for supporting high-density GPU clusters.

AI Lifecycle Stage	Requirements	Considerations
Reading Training Data	- Accommodate wide range of read BW requirements and IO access patterns across different AI models - Deliver large amounts of read BW to single GPU servers for most demanding models	- Use high performance, all-flash storage to meet needs - Leverage RDMA capable storage protocols, when possible, for most demanding requirements
Saving Checkpoints	- Provide large sequential write bandwidth for quickly saving checkpoints - Handle multiple large sequential write streams to separate files, especially in same directory	- Understand checkpoint implementation details and behaviors for expected AI workloads - Determine time limits for completing checkpoints
Restoring Checkpoints	- Provide large sequential read bandwidth for quickly restoring checkpoints - Handle multiple large sequential read streams to same checkpoint file	- Understand how often checkpoint restoration will be required - Determine acceptable time limits for restoration
Servicing GPU Clusters	- Meet performance requirements for mixed storage workloads from multiple simultaneous AI jobs - Scale capacity and performance as GPU clusters grow with business needs	- Consider scale-out storage platforms that can increase performance and capacity while providing shared access to data

Source: snia.org - John Cardente Talk

Storage options for AI Cloud and HPC Workloads

To achieve next-generation AI and HPC results, enterprises and product teams should evaluate both commercial vendors and open source platforms.

Open Source Parallel File Systems

Ceph (CephFS): Highly flexible, POSIX-compliant, scales from small clusters to exabytes. Used in academic and commercial AI labs for robust file and object storage. Many early stage AI factories are using solutions built on top of Ceph.
Lustre / DDN Lustre: Optimized for large-scale HPC and AI workloads. Used in many supercomputing and enterprise environments.
IBM Spectrum Scale (GPFS): High-performing parallel file system, widely used in science and industry.

Commercial AI and HPC Storage Solutions

VAST Data: Delivers extreme performance for AI storage, marrying parallel file system performance with the economics of NAS and archive. Vast has been very popular and adapted by popular AI Cloud players like CoreWeave and Lambda.
WEKA: Highly optimized metadata and file access for AI and multi-tenant clusters; helps overcome bottlenecks experienced in legacy systems. Similar to Vast, Weka has customers such as Yotta, Cohere, and Together.ai.
DDN: Industry leader for research, hybrid file-object storage, and scalable data intelligence for model training and analytics. DDN’s solutions, like Infinia and xFusionAI, focus on both performance and efficiency for GPU workloads.
Pure Storage, Cloudian, IBM, Dell: Also recognized for delivering enterprise-grade AI/HPC storage platforms.

Many solutions integrate natively with popular public clouds (AWS S3, Google Cloud Storage, Azure Blob)—enabling hybrid architectures and seamless data movement.

Product Examples and Use Cases

Ceph (Open Source): Used by research labs and private cloud teams to build petabyte-scale, resilient storage for AI and HPC clusters.
WEKA: Enterprise deployments often leverage WEKA for AI factories—a system with hundreds of GPUs running concurrent training jobs—thanks to its elastic scaling and metadata performance.
VAST Data: Designed to deliver high throughput for both small and large file operations, increasingly chosen for generative AI workloads and data-intensive analytics in fintech, healthcare, and media.
DDN: Supports hybrid deployment strategies; offers both parallel file system and object storage in a unified stack.

Parallel file systems such as Lustre and Spectrum Scale facilitate near-instant recovery, zero-data loss architectures, and compliance for regulated sectors.

Identifying the Best Storage for your needs

Because every cloud environment is unique, the first step in creating a distinctive solution is to establish a baseline through hardware benchmarking. MLCommons' benchmarking tools can be run directly on your hardware to gather reliable performance data.

The latest MLPerf Storage v2.0 benchmark results from MLCommons highlight the increasingly critical role of storage performance in the scalability of AI training systems. With participation nearly doubling compared to the previous v1.0 round, the industry’s rapid innovation is evident—storage solutions now support around twice the number of accelerators as before. The new iteration includes checkpointing benchmarks, which address real-world scenarios faced by large AI clusters, where frequent hardware failures can disrupt training jobs. By simulating such events and evaluating storage recovery speeds, MLPerf Storage v2.0 offers valuable insights into how checkpointing helps ensure uninterrupted performance in sprawling datacenter environments.

A broad spectrum of storage technologies took part in the benchmark—ranging from local storage, in-storage accelerators, to object stores—reflecting the diversity of approaches in AI infrastructure. Over 200 results were submitted by 26 organizations worldwide, many participating for the first time, which showcases the growing global momentum behind the MLPerf initiative. The benchmarking framework—open-source and rigorously peer-reviewed—provides unbiased, actionable data for system architects, datacenter managers, and software vendors. MLPerf Storage is a go-to resource for designing resilient, high-performance AI training systems in a rapidly evolving technology landscape.

Conclusion: Building Your AI Cloud and HPC Strategy

As the AI Cloud, GPU-as-a-service, and HPC landscape evolves, storage is no longer a background detail—it is the core differentiator for speed, scale, and future innovation. Vendor neutrality empowers you to architect best-of-breed systems, leveraging open-source foundations and integrating commercial solutions where they fit your needs. Every cloud or on-prem cluster will benefit from storage designed for AI and HPC, not just traditional workloads.

Ready for the next step? If you want to explore options, benchmark solutions, or design an optimized AI/HPC cloud, book a meeting with the CloudRaft team. Our experts bring hands-on experience from enterprise projects, migration strategies, and multi-vendor deployments, helping you maximize both infrastructure and business outcomes. Read more about our offering.

Expert Guide on Selecting Observability Products

Anjul Sahu — Sat, 13 Jul 2024 00:00:00 +0000

Guide to select Observability tools and products

In today's digital landscape, businesses are constantly striving to stay ahead of the curve. The ability to deliver exceptional customer experiences, maintain system reliability, and optimize performance has become a crucial differentiator. Enter observability – the linchpin of modern IT operations that empowers organizations to achieve operational excellence, drive cost-efficiency, and continuously enhance their services.

The rise of cloud-native architectures has revolutionized the way applications are built and deployed. These modern systems leverage dynamic, virtualized infrastructure to provide unparalleled flexibility and automation. By enabling on-demand scaling and global accessibility, cloud-native approaches have become a catalyst for innovation and agility in the business world.

However, this shift brings new challenges. Unlike traditional monolithic systems, cloud-native applications are composed of numerous microservices distributed across various teams, platforms, and geographic locations. This decentralized nature makes it increasingly complex to monitor and maintain system health effectively.

In this article, we'll explore the essential characteristics of a robust observability solution and provide guidance on selecting the right tools to meet your organization's unique needs.

Evolution in Observability Space

The evolution of observability over the last two decades has been characterized by significant technological advancements and changing industry needs. Let's explore this journey in more detail:

In the early 2000s, observability faced its first major challenge with the explosion of log data. Organizations struggled with a lack of comprehensive solutions for instrumenting, generating, collecting, and visualizing this information. This gap in the market led to the rise of Splunk, which quickly became a dominant player by offering robust log management capabilities. As the decade progressed, the rapid growth of internet-based services and distributed systems introduced new complexities. This shift necessitated more sophisticated Application Performance Management (APM) solutions, paving the way for industry leaders like DynaTrace, New Relic, and AppDynamics to emerge and address these evolving needs.

The dawn of the 2010s brought about a paradigm shift with the advent of microservices architecture and cloud computing. These technologies dramatically increased the complexity of IT environments, creating a demand for observability solutions that prioritized developer experience. This wave saw the birth of innovative platforms such as DataDog, Grafana, Sentry, and Prometheus, each offering unique approaches to monitoring and visualizing system performance. As we moved into the latter half of the decade, the industry faced a new challenge: skyrocketing observability costs due to the massive ingestion of Metrics, Events, Logs, and Traces (MELT). While monitoring capabilities had greatly improved, debugging remained a largely manual and time-consuming process, especially in the face of increasingly complex Kubernetes and serverless architectures. Some products like Datadog, Grafana, SigNoz, KloudMate, Honeycomb, Kloudfuse, Thanos, Coroot, and VictoriaMetrics tackled these new challenges head-on.

The early to mid-2020s have ushered in a new era of observability, characterized by innovative approaches to data storage and analysis. Industry standards like OpenTelemetry have gained widespread adoption, and products are now aligning with this standard. To optimize costs, observability pipelines are being used to filter and route data to various backends, automatically handling high cardinality data that was often a pain point at scale. We've also seen the adoption of high-performance databases like ClickHouse for monitoring purposes, often becoming the backend of choice for observability products. The emergence of eBPF technology has provided deep insights into system performance and inter-entity relationships. Due to the increased adoption of the Rust programming language for its high performance, some observability tools such as Vector and various agents have become lightweight and more efficient, allowing for further scalability. Products like Quickwit (see how Binance is storing 100PB logs) have introduced cost-effective and scalable solutions for storing logs and metrics directly on object storage. Perhaps most significantly, we're witnessing the integration of artificial intelligence into observability tools, enabling causal analysis and faster problem resolution. This AI-driven approach is helping organizations quickly narrow down issues in their increasingly complex environments, marking a new frontier in the observability landscape.

Systems are getting Complex

In the realm of modern, distributed systems, traditional monitoring approaches fall short. These conventional methods rely on predetermined failure scenarios, which prove inadequate when dealing with the intricate, interconnected nature of today's cloud-based architectures. The unpredictability of these complex systems demands a more sophisticated approach to observability.

Enter the new generation of cloud monitoring tools. These advanced solutions are designed to navigate the labyrinth of distributed systems, drawing connections between seemingly disparate data points without the need for explicit configuration. Their power lies in their ability to uncover hidden issues and correlate information across various contexts, providing a holistic view of system health.

Consider this scenario: a user reports an error in a mobile application. In a world of microservices, pinpointing the root cause can be like finding a needle in a haystack. However, with these cutting-edge monitoring tools, engineers can swiftly trace the issue back to its origin, even if it's buried deep within one of countless backend services. This capability not only accelerates root cause analysis but also significantly reduces mean time to resolution (MTTR).

But the benefits don't stop at troubleshooting. These tools can play a crucial role in refining deployment strategies. By providing real-time feedback on new rollouts, they enable more sophisticated deployment techniques such as canary releases or blue-green deployments. This proactive approach allows for automatic rollbacks of problematic changes, mitigating potential issues before they impact end-users.

As the cloud-native landscape continues to evolve, selecting the right monitoring stack becomes paramount. To maximize the benefits of modern observability, it's crucial to choose a solution that not only meets your current needs but also aligns with your future goals and the ever-changing demands of cloud-based architectures.

Essential Features of Robust Observability Solutions

In today's complex digital landscapes, selecting the right observability tools is crucial. Let's explore the key attributes that make an observability solution truly effective that aligns with the observability best practices.

Holistic Monitoring Capabilities

A comprehensive observability platform should adeptly handle the four pillars of telemetry data, collectively known as MELT:

Metrics: Quantitative indicators of system health, such as CPU utilization
Events: Significant system occurrences or state changes
Logs: Detailed records of system activities and operations
Traces: Request pathways through the system, illuminating performance bottlenecks

An ideal solution seamlessly integrates these data types, providing a cohesive view of your system's health.

Intelligent Data Analysis and Anomaly Detection

Modern systems often exhibit unpredictable behavior patterns, rendering static alert thresholds ineffective. Advanced observability tools employ machine learning to detect anomalies without explicit configuration, while still allowing for customization. By correlating anomalies across various telemetry types, these systems can perform automated root cause analysis, significantly reducing troubleshooting time.

Sophisticated Alerting and Incident Management

Real-time alerting is the backbone of effective observability. A top-tier solution should:

Alert on both customizable thresholds and AI-detected anomalies
Consolidate related alerts into actionable incidents
Enrich incidents with contextual data, runbooks, and team information
Intelligently route incidents to appropriate personnel
Trigger automated remediation workflows when applicable

To combat alert fatigue, the system should employ intelligent alert suppression, prioritization, and escalation mechanisms.

Data-Driven Insights

Analytics derived from telemetry data drive continuous improvement. Key metrics to track include Mean Time to Repair (MTTR), Mean Time to Acknowledge (MTTA), and various Service Level Objectives (SLOs). These insights facilitate post-incident analysis, helping teams prevent future issues and optimize system performance.

Extensive Integration Ecosystem

A versatile observability solution should seamlessly integrate with your entire tech stack:

Popular programming languages and frameworks
Open-source standards (OpenTelemetry, OpenMetrics, StatsD)
Container orchestration platforms (Docker, Kubernetes)
Security tools for vulnerability scanning
Incident management systems
CI/CD pipelines
Major cloud platforms
Team collaboration tools
Business intelligence platforms

Scalability and Cost Optimization

As applications grow in scale and complexity, managing observability costs becomes challenging. Look for tools that:

Identify underutilized resources and forecast future needs
Employ intelligent data sampling and retention policies
Efficiently handle high-cardinality data
Utilize cutting-edge technologies like eBPF for improved performance

Intuitive User Experience

An observability platform's UI/UX is critical for efficient debugging and insight gathering. Seek solutions offering:

Clear visualizations of system components and their relationships
Pre-configured dashboards for common scenarios
Easy integration with your existing stack
Comprehensive, user-friendly documentation
Ability to slice and dice visualizations and fast response time

Operational Simplicity

Scaling observability across an organization can be daunting. Look for platforms that:

Support "everything-as-code" for standardization and version control
Integrate smoothly with modern application platforms
Offer automation-friendly interfaces
Provide tools for managing observability at scale

Cost-Effective Data Management

As data volumes grow, intelligent data lifecycle management becomes crucial. Seek solutions offering:

Multi-tiered storage for different data types
Advanced compression and deduplication techniques
Intelligent data sampling strategies
Efficient handling of high-cardinality data

Alignment with Industry Standards

Choosing tools that support industry-standard protocols and frameworks (like OpenTelemetry, PromQL, and Grafana) ensures:

Easier integration with existing systems
Vendor-independent implementations
Flexibility to change backends without code modifications

Organizational Fit

When selecting an observability solution, consider your organization's unique needs:

System complexity and scale
User base characteristics
Budget constraints
Team skills and expertise

Prioritize platforms that cover your full stack, tying surface-level symptoms to root causes. Ensure the chosen solution integrates seamlessly with your current tech stack, DevSecOps processes, and team workflows. The ideal observability solution balances comprehensive insights with practical considerations, providing a powerful yet feasible tool for your organization's needs. Ideally, you want one or a few tools that are as effective as possible to justify their costs; you also want to avoid context switching. Let’s look at the key features of an effective application monitoring tool.

Conclusion

Selecting the ideal observability solution is a nuanced process that demands a deep understanding of your organization's unique ecosystem. It's not just about collecting data; it's about gaining actionable insights that drive meaningful improvements in your systems and processes.

The journey to effective observability requires a careful balance between comprehensive coverage and practical implementation. Your chosen solution should seamlessly integrate with your existing tech stack, enhancing rather than disrupting your current workflows. It's crucial to find a tool that not only provides rich, full-stack visibility but also aligns with your team's skills, your budget constraints, and your overall operational goals.

Remember, observability is a double-edged sword. When implemented effectively, it can provide unprecedented insights into your systems, enabling proactive problem-solving and continuous improvement. However, if not approached thoughtfully, it can lead to unnecessary complexity, spiraling costs, and a false sense of security. The risk of "running half blind" with suboptimal observability practices is real and can have significant implications for your operations and bottom line.

In this complex landscape, partnering with experts can make all the difference. CloudRaft, with its deep expertise in observability and extensive partnerships in the field, stands ready to guide you through this journey. Our experience can help you rapidly adopt and optimize modern observability practices, ensuring you reap the full benefits of these powerful tools without falling into common pitfalls.

By choosing the right observability solution and implementation approach, you're not just collecting data – you're empowering your team with the insights they need to drive innovation, enhance performance, and deliver exceptional user experiences. In today's fast-paced digital environment, that's not just an advantage – it's a necessity.

Authors:

Anjul Sahu: Anjul is a leading expert in observability and a thought leader. In the last one and half decades, he has seen all the waves, of how observability and monitoring have evolved in large-scale organizations such as Telcos, Banks, and Internet Startups. He also works with investors and product companies looking for advice on the current trends in observability.
Madhukar Mishra: Madhukar has over one decade of experience, building up the platform for a leading e-commerce company in India to a company that built Internet-scale products. He is interested in large-scale distributed systems and is a thought leader in developer productivity and SRE.

Secure Coding Best Practices

Anjul Sahu — Sat, 17 Jun 2023 13:19:12 +0000

Every single day, an extensive array of fresh software vulnerabilities is unearthed by diligent security researchers and analysts. A considerable portion of these vulnerabilities emerges due to the absence of secure coding practices. Exploiting such vulnerabilities can have severe consequences, as they possess the potential to severely impair the financial or physical assets of a business, erode trust, or disrupt critical services.

For organisations reliant on their software for their operations, it becomes imperative for software developers to embrace secure coding practices. Secure coding entails a collection of practices that software developers adopt to fortify their code against cyberattacks and vulnerabilities. By adhering to coding standards that embody best practices, developers can incorporate safeguards that minimise the risks posed by vulnerabilities in their code.

In a world brimming with cyber threats, secure coding cannot be viewed as optional if a business intends to maintain its shield of protection.

This article, we will explore some anti-patterns and best practices we can include in our workflow.

Anti-patterns

Now, let's briefly discuss various common mistakes or anti-patterns, categorised into insecure coding. The following are some examples:

Insufficient validation of input data or processing inputs without proper encoding or sanitisation.
Constructing SQL queries by concatenating strings, making the code vulnerable to data leaks or injection attacks.
Failure to implement robust authentication, such as storing credentials in plain text without proper hashing and encryption.
Poor design of password recovery mechanisms and infrequent rotation of security keys.
Software planning and design lacking strong authorisation schemes.
Granting excessive privileges during development or troubleshooting.
Exposing sensitive information in debug logging without appropriate redaction.
Utilising third-party libraries from untrusted sources or neglecting security checks.
Unsafe handling of memory pointers or allowing pointer access beyond system boundaries.

With these common mistakes in mind, let's explore practices and tools that can guide developers towards secure coding practices.

Secure Coding Best Practices

Shift left in software development lifecycle

Historically, the conventional practice involved assigning the software security team to conduct security testing towards the conclusion of a software development project. The team would assess the application and compile a list of issues that require resolution. At this stage, the identified fixes would be prioritised, resulting in some vulnerabilities being addressed while others remained unattended. The reasons for leaving certain vulnerabilities unresolved could range from cost constraints and limited resources to pressing business priorities.

However, this traditional approach is no longer sustainable. Security considerations must now be incorporated right from the outset—the initial stages—of the software development lifecycle. Security should be taken into account during the design phase itself. Both manual and automated testing should be conducted throughout the application's implementation as part of the Continuous Integration (CI) pipeline, ensuring that developers receive prompt feedback.

To aid in this endeavour, the utilisation of static code analysis becomes invaluable. This technique enables the scanning of code for security flaws and risks, even while developers are actively writing it within an integrated development environment (IDE). For instance, SAST tools offers the ability to analyse the code for security vulnerabilities during the development process, facilitating early identification and mitigation of potential risks.

Input validation

Ensuring the integrity of input data as it enters a system holds great significance. It is essential to validate the syntactic and semantic accuracy of all incoming data, considering it as untrusted. Employing checks and regular expressions aids in verifying the correctness, size, and syntax of the input.

Performing these validations on the server side is highly recommended. In the case of web applications, it involves scrutinising various components, including HTTP headers, cookies, GET and POST parameters, as well as file uploads.

Client-side validation also proves beneficial, contributing to an enhanced user experience by reducing the need for multiple network requests resulting from invalid inputs. This approach minimises back-and-forth communication and enhances efficiency.

Parameterised queries

During the process of storing and retrieving data, developers frequently engage with datastores. However, if they overlook the utilisation of parametrised queries, it can expose an opportunity for attackers to exploit widely accessible tools and manipulate inputs to extract sensitive information. SQL injection, a highly perilous application risk, exemplifies a common form of such attacks.

By incorporating placeholders for parameters within the query, the specified parameters are treated as data rather than being considered as part of the SQL command itself. To mitigate these vulnerabilities, it is recommended to employ prepared statements or object-relational mapping (ORM) techniques. These approaches offer effective measures to safeguard against SQL injection and related threats.

Encoding data

Encoding data plays a vital role in mitigating threats by transforming potentially hazardous special characters into a sanitised format. Base64 encoding serves as an exemplar of such encoding techniques, offering protection against SQL injection, cross-site scripting (XSS), and client-side injection attacks.

To enhance security, it is crucial to specify appropriate character sets, such as UTF-8, and encode data into a standardised character set before further processing. Additionally, employing canonicalisation techniques proves beneficial. For instance, simplifying characters to their basic form helps address issues such as double encoding and obfuscation attacks, thereby bolstering overall security measures.

Implement identity and authentication controls

To further enhance security and minimise the risk of breaches, secure coding practices emphasise the importance of verifying a user's identity at the outset and integrating robust authentication controls into the application's code.

Here are some recommended measures to achieve this:

Employ strong authentication methods, such as multi-factor authentication, to add an additional layer of security.
Consider incorporating biometric authentication methods, such as fingerprint or facial recognition, especially in mobile applications.
Ensure secure storage of passwords. Typically, this involves hashing the password using a strong hashing function and securely storing the encrypted hash in a database.
Implement a secure password recovery mechanism to facilitate password resets while maintaining security.
Enable session timeouts and inactivity periods to automatically terminate idle sessions.
For sensitive operations like modifying account information, enforce re-authentication to validate the user's identity.
Conduct regular audits of authentication transactions to detect any suspicious activities and maintain a vigilant stance against potential threats.

Implement access controls

Incorporating a well-thought-out authorisation strategy during the initial stages of application development can greatly enhance the overall security posture. Authorisation entails determining the specific resources that an authenticated user can or cannot access.

Consider the following guidelines to strengthen the authorisation framework:

Establish a sequential flow of authentication followed by authorisation. Implement a mechanism where all requests undergo access control checks.
Adhere to the principle of least privilege, initially denying access to any resource that has not been explicitly configured for access control.
Enforce time-based limitations on user or system component actions by implementing expiration times, thereby ensuring that actions have defined timeframes for execution.

By following these practices, developers can create a robust and effective authorisation system that bolsters the overall security of the application.

Protect sensitive data

In order to comply with legal and regulatory obligations, it is the responsibility of businesses to safeguard customer data. This sensitive data encompasses various categories, including:

Personally identifiable information (PII)
Financial transactions
Health records
Web browser data
Mobile data etc

To prevent data leakage, it is crucial to employ robust encryption methods for both data at rest and data in transit. Consider the following practices to enhance data protection:

Utilise a well-established, peer-reviewed cryptographic library and functions that have been vetted and approved by your security team.
Avoid storing encryption keys alongside the encrypted data to prevent unauthorised access.
Refrain from storing confidential or sensitive data in memory, temporary locations, or log files during processing.
Implement redaction technique in log forwarders to remove sensitive information.
Implement mandatory re-authentication when accessing sensitive data within the application.

Implement logging and intrusion detection

Even the most meticulously designed system can be susceptible to exploitation by attackers. Therefore, it is advisable to incorporate a monitoring system that can detect and identify unusual events. It is crucial to ensure that sufficient information is logged concerning authentication, authorisation, and resource access events. This logging should include details such as timestamps, the origin of access requests, IP addresses, and information pertaining to the requested resource. It is important to store this information in a secure and protected log. Typically, these logs are transmitted in real time to a centralised system where they are analysed for any anomalies. Prior to logging, apply encoding techniques to the untrusted data to safeguard against log injection attacks.

In the event of a security breach, it is essential to have a well-documented playbook in place to promptly terminate system access, mitigating the risk of further data leakage. By following these practices, organisations can enhance their ability to detect and respond to potential intrusions, minimising the impact of security incidents.

Leverage security frameworks and libraries

Avoid unnecessary duplication of effort. Instead, leverage established security frameworks and libraries that have been proven effective. When incorporating such components into your project, ensure they are sourced from reliable and trusted third-party repositories. It is important to regularly assess these libraries for any vulnerabilities or weaknesses and proactively keep them up to date.

By adopting this approach, you can benefit from the expertise and experience embedded in these established security solutions, saving valuable time and effort while maintaining a strong security posture.

Monitor error and exception handling

In line with the best practices of logging, it is advisable to adopt a centralised approach for handling and monitoring errors and exceptions with tools like Sentry. Effective management of errors and exceptions is crucial as mishandling them can inadvertently expose valuable information to potential attackers, enabling them to gain insights into your application and platform design.

Consider the following measures to strengthen error and exception handling:

Avoid logging sensitive information within error messages to prevent inadvertent disclosure.
Regularly conduct code reviews to identify and address any weaknesses or vulnerabilities in the error handling implementation.
Utilise negative testing techniques, such as exploratory and penetration testing, fuzzing, and fault injection, to actively identify and rectify potential issues related to error handling.

By implementing these practices, you can ensure that error and exception handling is performed securely and with minimal risk of exposing sensitive information to potential attackers.

Benefits of implementing secure coding practices

At this point, the advantages of embracing secure coding practices should be evident:

Incorporating automated checks and code analysis during the development process enhances developer productivity by promptly providing feedback to improve code security. This leads to quicker time-to-market and higher-quality code.
Cost optimisation within the software development lifecycle is achieved by minimising bugs at the early stages.
Static application security testing (SAST) tools offer developers of all skill levels guardrails, AppSec governance, and valuable insights through IDE plugins. These tools equip developers with the necessary knowledge and resources to bolster application security.

Conclusion

Throughout our examination of coding flaws that can result in vulnerabilities, we have also explored best practices to enhance the security stance of software. However, in the context of large-scale projects, it can be daunting to implement these practices while ensuring proper governance.

In the realm of extensive projects, the following considerations can help navigate these challenges effectively:

Establish clear governance frameworks that outline security requirements, procedures, and responsibilities.
Develop comprehensive guidelines and standards that align with secure coding practices and provide actionable steps for implementation.
Foster collaboration and communication among development teams, security experts, and stakeholders to ensure a shared understanding of security goals and the necessary measures to achieve them.
Prioritise the implementation of security measures by identifying high-risk areas and focusing resources accordingly.
Regularly assess and review the security posture of the software throughout the development lifecycle, enabling continuous improvement and adjustments as needed.

By adopting these approaches, the process of implementing secure coding practices within large projects becomes more manageable and ensures that proper governance is in place to safeguard against vulnerabilities effectively.

It is advisable to create and automate workflows using SAST tools and integrate in CI to enforce the best practices. Feel free to schedule a non-obligatory call with us to discuss DevSecOps strategy and we can help you improve your current practice.

DevOps Roadmap 2022

Anjul Sahu — Mon, 21 Feb 2022 19:20:28 +0000

In the last few weeks, I met some folks in my mentoring sessions, who are new to DevOps or in the mid of their career, were interested in knowing what to learn in 2022. DevOps skills are high in demand and there is constant learning required to keep yourself in sync with market demand.

This post is to share the notes that can help you. Let's see some guidance based on my experience and understanding.

Roadmap

Be fundamentally strong in networking technologies

Understand the concepts such as HTTP/2, QUIC or HTTP3, Layer 4 and Layer 7 protocols, mTLS, Proxies, DNS, BGP, how load balancing works, IPTables, the working of Internet, IP addresses and schemes, and lastly the Network design. I found Julia Evans's blog very useful and my go to place when I need to understand stuff in a simple way. She has covered a wide variety of topics in her blog posts and zines.

Master the operating system fundamentals particularly Linux

As most of the systems (VMs, Containers, etc) run Linux, it is important to know from top to bottom. Learn scheduling, systemd interface, init system, cgroups and namespaces, performance tuning, and mastering the command line utilities - awk, sed, jq, yq, curl, ssh, openssl etc., Learn performance troubleshooting from Brendan's blog.

CI/CD

If you are still into Jenkins, it is fine. But, the world has moved to cloud-native pipelines. Conceptually not much has changed in this space, but you can look into Github Actions, Tekton etc. How to do releases better? Understand various deployment strategies such as blue green and canary.

Containerization and Virtualization

Apart from the popular Docker runtime, try containerd, podman etc and knowing How to containerise applications, how to implement container security, how to run and orchestrate VMs in Kubernetes, see KubeVirt project.

Container Orchestration

Kubernetes is now a de facto standard for running containers. There is a lot of content on the Internet to learn Kubernetes. Focus on configuration best practices, application design, security and scheduling. Setting up a cluster is getting trivial now but the day 2 operational stuff such as setting up, monitoring, logging, CI/CD, how to scale the cluster, cost optimization and security are some of the challenges that you are expected to solve.

Observability at Scale

Most of the engineers are aware of the Prometheus Grafana stack or similar. Trends suggest that many organizations are consolidating their Kubernetes clusters and observability, both from the performance and cost perspective, this helps. Learn about the advanced configurations and architectures of Prometheus, and how to scale them. Look into technologies like Thanos, Cortex, VictoriaMetrics, Datadog, and Loki. Continuous profiling tools such as Parca, periscope, hypertrace and distributed tracing with open telemetry. Service meshes such as Istio are popular ingredient in cloud-native recipes.

Platform team as a Product team

The platform team is becoming more like a centralized product team who are focusing on their internal platform customers such as developers and testers. The goal is to improve the ways of working and bring some order to the teams. Try to improvise on the problems the Developer and QA team faces. You are the enabler for other teams, instead of taking all the work in a central team, coach the dev team to take up typical DevOps responsibilities. That way you can scale and don't burn yourself too much.

Security

In many small organisations, security was a second class citizen. Product features were given more priority. But, due to growing sophisticated attacks and various strict compliances, companies are adapting to a shift-left security strategy. End-to-end encryption, strong RBAC, IAM policies, governance and auditing, implementation of benchmarks such as NIST, CIS, ISO27001 are common. Container security, Policy as code, Cloud Governance and Supply chain security are hot topics.

Programming

DevOps or SRE role is now taking the cross-cutting concerns of the Developers and creating tooling that can help in improving their productivity while enforcing the standards. A good quality software engineering practice and skill are required to craft the high quality platform components.

I can't give enough stress to this. The good organizations are looking for good programming experience in Platform engineers. It is important in site reliability engineering as well, where you need to be fluent in programming, able to read, understand and debug the code written by others and if necessary, fix it.

Python and Golang are the most popular ones. My suggestion is Golang due to features like strong concurrency, strict type checking, adoption in various orgs, toolchains and as many major projects are built using Golang, it makes sense to learn that over Python.

A few simple things you can try:

Write a CLI in your programming language.
Learn to write a REST API and interact with databases
Parallelism and Concurrency

Infrastructure as Code

Terraform is a standard in the projects. Once you understand the concept, it is easy to adapt to any other tooling as most of them are based on DSL.

Cloud

Most of the cloud works in the same way. So if you know one cloud well, you can easily work with other cloud providers. Focus on how you can design applications using cloud-native components in a highly available, resilient, secured, and cost-effective way.

Technical Writing

You might be wondering why I am talking about technical writing when discussing DevOps. A lot of folks don't give enough attention to this, but it is super important how you communicate and work with other teams. The future of work is remote and emails, slack/teams, chats are the primary channels to talk and convey idea to others.

On a regular basis, you might be creating documents such as runbooks, postmortems, RFCs, architectural decision records and software design docs, to name a few. A clear, easy to understand document does wonders. It can help you save your and the reader's time and improve overall productivity. Suggest you to read this article.

Site Reliability Engineering

The boundary between DevOps and SRE is getting thin. In some organisations, the same person might be performing both roles. Understand the concepts behind SLI, SLO, and Error budgets and SRE practices. Each organization does it differently, so I wouldn't suggest copy-paste someone else's culture in to your team. Refer to the Google SRE culture.

Conclusion

Personally, I am excited about following this year. This is not a definite list as it keeps changing with time.

Service Mesh - Istio, Cilium Sidecarless mesh, Tetrate and Solo's Gloo mesh offering.
How to improve Developer Productivity? It is a mix of culture, automation and tools.
SRE Platforms - honeycomb, Last9.
DevPortals - again linked with the motive of improving productivity and bridging knowledge gap.
Observability - technologies such as open telemetry, hypertrace, Thanos, VictoriaMetrics, Vector.
Security - supply chain security, code signing, tightening cloud security.
Golang - improving the current skills.
Serverless computing and event-driven architectures
Web3 - understanding the landscape related to DevOps and Infrastructure

Be curious and keep learning. Continuous bite-size learning is easy which you can do along with your full time job. If you still have any questions, feel free to ping me on twitter.

I also curate cloud-native articles, tutorials and news in my weekly newsletter. Subscribe to Cloud Native Weekly to get the latest updates.

Machine Learning Orchestration on Kubernetes using Kubeflow

Anjul Sahu — Wed, 24 Mar 2021 05:22:07 +0000

MLOps: From Proof Of Concepts to Industrialization

In recent years, AI and Machine Learning have seen tremendous growth across industries in various innovative use cases. It is the most important strategic trend for business leaders. When we dive into technology, the first step is usually experimentation on a small scale and for very basic use cases, then the next step is to scale up operations. Sophisticated ML models help companies efficiently discover patterns, uncover anomalies, make predictions and decisions, and generate insights, and are increasingly becoming a key differentiator in the marketplace. Companies recognise the need to move from proof of concepts to engineered solutions, and to move ML models from development to production. There is a lack of consistency in tools and the development and deployment process is inefficient. As these technologies mature, we need operational discipline and sophisticated workflows to take advantage and operate at scale. This is popularly known as MLOps or ML CI/ CD or ML DevOps. In this article, we explore how this can be achieved with the Kubeflow project, which makes deploying machine learning workflows on Kubernetes simple, portable, and scalable.

MLOps in Cloud Native World

There are Enterprise ML platforms like Amazon SageMaker, Azure ML, Google Cloud AI, and IBM Watson Studio in public cloud environments. In case of on-prem and hybrid open source platform, the most notable project is Kubeflow.

What is Kubeflow?

Kubeflow is a curated collection of machine learning frameworks and tools. It is a platform for data scientists and ML engineers who want to experiment with their model and design an efficient workflow to develop, test and deploy at scale. It is a portable, scalable, and open-source platform built on top of Kubernetes by abstracting the underlying Kubernetes concepts.

Kubeflow Architecture

Kubeflow utilizes various cloud native technologies like Istio, Knative, Argo, Tekton, and leverage Kubernetes primitives such as deployments, services, and custom resources. Istio and Knative help provide capabilities like blue/green deployments, traffic splitting, canary releases, and auto-scaling. Kubeflow abstracts the Kubernetes components by providing UI, CLI, and easy workflows that non-kubernetes users can use.

For the ML capabilities, Kubeflow integrates the best framework and tools such as TensorFlow, MXNet, Jupyter Notebooks, PyTorch, and Seldon Core. This integration provides data preparation, training, and serving capabilities.

Let's look at Kubeflow Components

Central Dashboard: User interface for managing all the Kubeflow pipeline and interacting with various components.
Jupyter Notebooks: It allows to collaborate with other team members and develop the model.
Metadata - It helps in organizing workflows by tracking and managing the metadata in the artifacts. In this context, metadata means information about executions (runs), models, datasets, and other artifacts. Artifacts are the files and objects that form the inputs and outputs of the components in your ML workflow.
Fairing: It allows running training job remotely by embedding it in Notebook or local python code and deploy the prediction endpoints.
Feature Store (Feast): It helps in feature sharing and reuse, serving features at scale, providing consistency between training and serving, point-in-time correctness, maintaining data quality and validation.
ML Frameworks: This is a collection of frameworks including, Chainer (deprecated), MPI, MXNet, PyTorch, TensorFlow, providing
Katib: It is used to implement Automated machine learning using Hyperparameters (variables to control the model training process), Neural Architecture Search (NAS) to improve predictive accurancy and performance of the model, and a web UI to interact with Katib.
Pipelines: Provides end-to-end orchestration and easy to reuse solution to ease the experimentations.
Tools for Serving: There are two model serving systems that allow multi-framework model serving: KFServing, and Seldon Core. You can read more about tools for serving here.

What are some of the Kubeflow Use Cases?

Hybrid multi-cloud ML Platform at scale: As Kubeflow is based on Kubernetes, it utilized all the features and power that Kubernetes provides. This allows you to design ML platforms that are portable and utilize the same APIs etc. to run on on-prem and public clouds.
Experimentation: Easy UI and abstration helps in rapid experimentation and collaboration. This speeds development by providing guided user journeys.
DevOps for ML platform: Kubeflow pipelines can help creating reproducible workflows which delivers consistency, saves iteration time, and helps in debugging, auditability, and compliance requirements.
Tuning the model hyperparameters during training: During model development, hyperparameters tuning is often hard to tune and time consuming. It is also critical for model performance and accuracy. Katib can reduce the testing time and improve the delivery speed by automating hyperparameters tuning.

Kubeflow Demo

Let's try to learn Kubeflow with an example. In this demo, we will try Kubeflow on a local Kind cluster. You should have at least 16GB of RAM, 8 CPUs modern machine to try it on your local machine, otherwise use a VM in cloud. We will use Zalando's fashion MNIST dataset and this notebook by manceps for demo.

Due to some issue, I had to enable few feature gates and extra API server arguments to make it work.
Please use the following Kind configuration to create the cluster.

# kind cluster configuration - kind.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "TokenRequest": true
  "TokenRequestProjection": true
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    metadata:
      name: config
    apiServer:
      extraArgs:
        "service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
        "service-account-issuer": "kubernetes.default.svc"

Create the Kind cluster and install Kubeflow.

# Create Kind cluster
kind create cluster --config kind.yaml


# Deploy Kubeflow on Kind. 

mkdir -p /root/kubeflow/v1.0
cd /root/kubeflow/v1.0
wget https://github.com/kubeflow/kfctl/releases/download/v1.0/kfctl_v1.0-0-g94c35cf_linux.tar.gz

tar -xvf kfctl_v1.0-0-g94c35cf_linux.tar.gz         
export PATH=$PATH:/root/kubeflow/v1.0
export KF_NAME=my-kubeflow
export BASE_DIR=/root/kubeflow/v1.0
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml" 

mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -f ${CONFIG_URI}

It may take 15-20 minutes to bring up all the services.

❯ kubectl get pods -n kubeflow
NAME                                                     READY   STATUS    RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0               1/1     Running   0          19m
admission-webhook-deployment-5cd7dc96f5-4hsqr            1/1     Running   0          18m
application-controller-stateful-set-0                    1/1     Running   0          21m
argo-ui-65df8c7c84-dcm6m                                 1/1     Running   0          18m
cache-deployer-deployment-5f4979f45-6fvg2                2/2     Running   1          3m21s
cache-server-7859fd67f5-982mg                            2/2     Running   0          102s
centraldashboard-67767584dc-f5zhh                        1/1     Running   0          18m
jupyter-web-app-deployment-8486d5ffff-4cb8n              1/1     Running   0          18m
katib-controller-7fcc95676b-brk2q                        1/1     Running   0          18m
katib-db-manager-85db457c64-bb7dp                        1/1     Running   3          18m
katib-mysql-6c7f7fb869-c4qqx                             1/1     Running   0          18m
katib-ui-65dc4cf6f5-qrjpm                                1/1     Running   0          18m
kfserving-controller-manager-0                           2/2     Running   0          18m
kubeflow-pipelines-profile-controller-797fb44db9-hdnxc   1/1     Running   0          18m
metacontroller-0                                         1/1     Running   0          19m
metadata-db-6dd978c5b-wtglv                              1/1     Running   0          18m
metadata-envoy-deployment-67bd5954c-z8qrv                1/1     Running   0          18m
metadata-grpc-deployment-577c67c96f-ts9v6                1/1     Running   6          18m
metadata-writer-756dbdd478-7cbgj                         2/2     Running   0          18m
minio-54d995c97b-85xl6                                   1/1     Running   0          18m
ml-pipeline-7c56db5db9-9mswf                             2/2     Running   0          18s
ml-pipeline-persistenceagent-d984c9585-82qvs             2/2     Running   0          18m
ml-pipeline-scheduledworkflow-5ccf4c9fcc-mjrwz           2/2     Running   0          18m
ml-pipeline-ui-7ddcd74489-jw8gj                          2/2     Running   0          18m
ml-pipeline-viewer-crd-56c68f6c85-tszc4                  2/2     Running   1          18m
ml-pipeline-visualizationserver-5b9bd8f6bf-dj2r6         2/2     Running   0          18m
mpi-operator-d5bfb8489-9jzsf                             1/1     Running   0          4m27s
mxnet-operator-7576d697d6-7wj52                          1/1     Running   0          18m
mysql-74f8f99bc8-fddww                                   2/2     Running   0          18m
notebook-controller-deployment-5bb6bdbd6d-vx8tv          1/1     Running   0          18m
profiles-deployment-56bc5d7dcb-8x7vr                     2/2     Running   0          18m
pytorch-operator-847c8d55d8-zgh2x                        1/1     Running   0          18m
seldon-controller-manager-6bf8b45656-6k8r7               1/1     Running   0          18m
spark-operatorsparkoperator-fdfbfd99-5drsc               1/1     Running   0          19m
spartakus-volunteer-558f8bfd47-h2w62                     1/1     Running   0          18m
tf-job-operator-58477797f8-86z42                         1/1     Running   0          18m
workflow-controller-64fd7cffc5-77g6z                     1/1     Running   0          18m

Now, you can access the Kubeflow dashboard by port-forwarding on http2/$INGRESS_PORT where can be fetched using below.

export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

Let's Try an Experiment

We will be using Zalando's Fashion-MNIST dataset to show basic classification using Tensorflow in this experiment.

About the Dataset
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the exact image size and structure of training and testing splits.
source: https://github.com/zalandoresearch/fashion-mnist

The whole experiment is sourced from manceps notebook. Create a Jupyter notebook with the name kf-demo using this notebook.

You can run the notebook from the dashboard and create the pipeline. Please note, in Kubeflow v1.2, there is an issue causing RBAC: permission denied error while connecting to the pipeline. This will be fixed in v1.3 and you can read more about the issue here. As a workaround, you need to create Istio ServiceRoleBinding and EnvoyFilter to add an identity in the header. Refer this gist for the patch.

The Kubeflow will orchestrate various components to create the pipeline and run the ML experiment. You can access the results through the dashboard. Behind the scene, Kubernetes pods, argo workflows, etc., are created which you don't need to worry about.

Pods running the kf-demo notebook and pipeline

I also noticed that when running the Pipeline in Kind, it complained about the following:

MountVolume.SetUp failed for volume "docker-sock" : hostPath type check
       failed: /var/run/docker.sock is not a socket file

To resolve this, I had to change the Argo Workflow ConfigMap to use pns instead of docker container runtime executor.

After the change, re-run the experiment from the dashboard, which will then passthrough.

Experiment Flow

Prediction Result

Conclusion

If you are looking for bringing agility, improved management with enterprise-grade features such as RBAC, multi-tenancy and isolation, security, auditability, collaboration for the machine learning operations in your organization, Kubeflow is an excellent option. It is stable, mature and curated with best-in-class tools and framework which can be deployed in any Kubernetes distribution. See Kubeflow roadmap here to look into what's coming in the next version.

Hope this was helpful to you. Do try Kubeflow and share your experience by connecting with me on Twitter.

References

Autonomous Log Monitoring and Incident Detection with Zebrium

Anjul Sahu — Sat, 10 Oct 2020 10:32:58 +0000

Why do we need Autonomous Log Monitoring?

Everything fails, all the time – Werner Vogels, Amazon

As the Amazon CTO once quoted, systems even if they are thoughtfully designed with the utmost care and skills, may fail. Thus, it is important to detect failures using automation to reduce the burden on DevOps and SREs. Developers use extensive prebuilt libraries and products to go to market as fast as they can due to the high-velocity development lifecycle. It is the onus on the SREs to keep the service alive and keeping the MTTR (Mean time to recover) to a minimum. This leads to a problem when the system becomes a black box for SRE and they have to put observability on top of it. Without knowing the internals and without having complete control over the logging information and metrics, they may run blindfolded sometimes until they learn more about the system, new issues, and until they improvise their playbooks or write a solution to prevent failures from happening in the future. That’s the human way of solving problems — learning by mistakes.

It is quite a common scenario in a large distributed system when there is an incident, the teams spend a lot of time to capture the right logs, parse it and try to correlate to find the root cause. Some teams are better, they automate log collection, aggregates them to a common platform, and then do all the hard work by searching into the ocean of log data using tools like Elastic or Splunk. It works fine when you understand the log structure and all the components and know what to look for. But as I mentioned above, it is really hard to keep the data structure consistent for a long time across all components. Most of the current log monitoring and collection tools just provide the capability to collect logs to a central place, parse the unstructured data, allow you to search or filter, and show visualization or trends. What if the system generates a new type of log data or pattern for which you have not automated or planned in advance? It becomes a problem.

That is the point when you really need autonomous machine learning to scale.

Automation is the key to detect such incidents, anomalies in the system — and proactively try to prevent as much as possible to reduce the chances of failure and improve recovery time. — Google SRE Handbook

Typically when an incident occurs, support engineers try to manually peek into the ocean of logs and metrics to find interesting errors and warnings and then start correlating various observations to come up with a root cause. This is a painfully slow process where a lot of time is wasted. This is where Zebrium machine learning capabilities helps in automatically correlating issues observed in logs and metrics of various components to predict the root cause.

Introducing Zebrium

The Zebrium autonomous log and metrics monitoring platform uses machine learning to catch software incidents and show IT and cybersecurity teams the root cause. It is designed to be used with any application, and it is known for its ease of use and quick set-up with customers, who also say the system often delivers initial results within hours of being installed. Unlike traditional monitoring and log management tools that require complex configuration and tuning to detect incidents, Zebrium’s approach to using unsupervised machine learning requires no manual configuration or human training. It is one of the top 25 enterprise software startups to watch for in 2020 in the Gartner report.

Zebrium Dashboard

Zebrium aggregates logs and metrics and makes them searchable using filters through easy navigation and drill-down. It also allows us to build alert rules — but most of the time you won’t have to! It uses unsupervised machine learning to autonomously learn the implicit structure of the log messages. It then cleanly organizes the content of each event type into tables with typed columns – perfect for fast and rich queries, reliable alerts, and high-quality pattern learning and anomaly detection. But most importantly, it uses machine learning to automatically catch problems and to show you root cause without you having to manually build any rules.

You can learn more about how it works here.

Integrations

Zebrium provides various types of log collectors that can pull logs from Kubernetes, Docker, Linux, ECS, Syslog, AWS Cloudwatch, and any type of application.

ze-cloudwatch Lambda function – This is the typical pattern to pull logs from AWS Cloudwatch to the Zebrium Platform.
Container/Docker Log Collector: You can refer to this for more information.
Zebrium Kubernetes log collector – By far the easiest way to stream data from the Kubernetes cluster. It takes less than 2 mins to set up.
Kubernetes Metrics Collector – Zebrium has created a metrics collector to pull Kubernetes metrics and push to the platform. It requires 4Gi memory for every 100 nodes.
Zebrium FluentD collector – An easy way to stream logs from a Linux host.
Log Forwarder – to send the Syslog or any raw log to the platform.
Zebrium CLI (Ze) – A flexible way to stream log data or upload log files.

Zebrium + ELK (ZELK) Stack — see here

Zebrium provides good integration with existing Elastic Stack (ELK Stack) clusters. You can even view the Zebrium incident dashboard inside Kibana. You can do so by doing the following:

Configure an additional output plugin in your Logstash instance to send log events and metrics to Zebrium.
Zebrium’s Autonomous Incident Detection and Root Cause will send incident details back to Logstash via a webhook input plugin.
Incident summary and drill down into the Incident events in Elasticsearch is available directly from the Zebrium ML-Detected Incidents canvas in Kibana.
For advanced drill-down and troubleshooting workflows, simply click on the Zebrium link in the Incident canvas.

Third-Party Integrations

Zebrium’s Autonomous Incident & Root Cause Detection works in two modes:

It can autonomously detect and create incident alerts by applying machine learning to an incoming stream of logs and metrics. The incident alerts can be consumed via custom webhook, Slack, or email.
Zebrium can also consume an external signal that indicates an incident that HAS occurred, and in response, it will create an incident report consisting of correlated log and metric anomalies, including likely root cause and symptoms surrounding the incident.

A special class of integrations relates to this second mode, including integrations with OpsGenie, PagerDuty, VictorOps, and Slack. Furthermore, Zebrium integration can be extended to any custom application using webhooks.

Security

Logical separation and an optional physical separation of data are possible. Each organization’s data is stored in its own schema with proper access control. For those who need further security (physical separation), a dedicated VPC is used.
Multifactor authentication and encryption. Data at rest is encrypted with AES-256 encryption.
Handling of sensitive data – Zebrium provides a way to filter out sensitive data/fields. It also provides a way to clinically remove data if uploaded accidentally.
The system runs in AWS which has PCI DSS, Fedramp, and other leading industry security certifications.

What did I like about Zebrium?

Quick and easy onboarding with no manual training or rules setup differentiates this product from others.
Comes with native collectors to consume logs from Kubernetes clusters, Docker, Linux, and AWS Cloudwatch.
SaaS-based – provides easy access through Web and Webhooks. This could be a problem for a few who want an on-premise setup.
Integration with Elastic (ELK) is a plus.
Unsupervised machine learning doesn’t require any input to train initial data.
Grafana integration is provided to chart Zebrium collected data.
Easy to understand pricing structure. A $0 plan to try the core features.

What can be Improved?

Integration with AWS Cloudwatch is provided but not with various other cloud providers like Google, Azure, etc.
Integration with incident management systems like ServiceNow etc typically deployed in Enterprises is not documented. It may be possible using the webhooks but I haven’t tried.

Conclusion

Machine intelligence is the key to automate and scale in a large enterprise environment which can reduce operational cost by reducing DevOps/SREs and increase MTTR that can radically transform the business. With the unsupervised learning algorithm used by Zebrium, It becomes easier to find a better correlation between incidents and failures from the log data and metrics without requiring human effort. Zebrium has provided simplified onboarding, that requires no configurational changes in the application or human training, and an easy to navigate UI. It is an appealing next-generation choice in the space of autonomous log and metric management platforms.

Please try their free version to play around with the autonomous machine learning algorithm on your log data and let us know about your thoughts on autonomous log monitoring.

The post Autonomous Log Monitoring and Incident Detection with Zebrium appeared first on InfraCloud Technologies.

Running Oracle Database on Kubernetes and worried about Backup & Recovery?

Anjul Sahu — Wed, 09 Sep 2020 12:06:05 +0000

I have been using Oracle database for more than a decade and one of the challenging tasks as a DBA was always keeping the configurations in the consistent state across environments and I can't forget those nights when I had to recover the database when someone dropped critical data.

Time has changed. In the last 7 years, since the introduction of docker and Kubernetes, the resiliency and DevOps culture have improved the situation for most of the stateful applications but Oracle has always discouraged running Oracle database as a containerized application.

If you are interested, read in this post where we discuss, how to containerize the oracle database and run it on Kubernetes. It answers how not to worry about backup and recovery using Cloud-native solutions like the Kasten K10 platform. It uses a snapshot-based backup of your Kubernetes application and state (data) and provides application-consistent backups. We have tried this on Oracle 12c to 19c and it works without any issues.

Please try and let me know your experience.

Anjul

The Ten Commandments of Container Security

Anjul Sahu — Thu, 30 Jul 2020 17:23:26 +0000

A cybersecurity incident can cause severe damage to the reputation of the organization and competitive disadvantage in the market, the imposition of penalties, and unwanted legal issues by end-users. On average, the cost of each data breach is USD 3.92 million as per this IBM report. The biggest challenges providing security in organizations are:

Lack of skills and training in security tools and practices

Lack of visibility and vulnerabilities

Continuous monitoring of the current state of security

In the recent survey by PaloAlto Networks, State of Cloud Security report, it was discovered that 94% of organizations use one or more cloud platforms and around 45% of their compute is on containers or CaaS. The dominance of containers is increasing and thus the security threats. The top issues identified as being a threat in these reports are:

Data exposure and malware

Application vulnerabilities

Weak or broken authentication

Misconfigurations

Incorrect or over-permission access

Insider threats

Credential leakage

Insecure Endpoints

In this article, we will go through some of the best practices, we can implement to reduce the security risks in the containerized workloads.

Top 10 things to do to secure the Application Containers

1. Source base image from trusted repositories

When we create a container image, we often rely on the seed image sourced from popular private or public registries. Be aware that in the supply chain of the image production, someone can penetrate and drop malicious code which could open the doors to attackers. Just to give an example of this, in 2018, some hacker targetted British Airways web application with malicious javascript code by attacking their software supply chain. A couple of years back, Docker identified few images on Docker Hub which were having Cryptominers installed in the Image.

Below are some tips:

When creating the container image, please use a hardened base image sources from a well known trusted publisher.
Pick those images which are published frequently with the latest security fixes and patches.
Use signed and labeled images (sign with Notary or similar tools) and verify the authenticity of the image during the pull to stop man-in-the-middle attacks.

2. Install verified packages

As much the sourcing of base image needs to be from trusted sources, the packages installed on top of the base also need to be from verified and trusted sources for the same reason.

3. Minimize attack surface in the Image

What I mean by surface area is the number of packages and libraries installed in the image. Common sense is if the number of objects is less, the chances of having vulnerability is also reduced. Keep the image size minimal satisfying the application runtime requirements. Preferably, have only a single Application in one application container.

Remove unnecessary tools and software like package managers (eg. yum, apt), network tools and clients, shells from the image, netcat (can be used to create reverse shell).
Use the multi-stage Dockerfiles to remove software build components out of production images.
Do not expose unnecessary network ports, sockets or run unwanted services (e.g. SSH daemon) in the container to reduce threats.
Choose alpine images or scratch images or container optimized OS as compared to full-blown OS images for the base image.

4. Do not bake secrets in the image

All the secrets should be kept out of the image and Dockerfile. The secrets include SSL certificates, passwords, tokens, API keys, etc should be kept outside and should be securely mounted through the container orchestration engine or external secret manager. Tools like Hashicorp Vault, Cloud provided secret management services like AWS Secrets Manager, Kubernetes secrets, Docker secrets management, CyberArk, etc. can improve the security posture.

5. Use of Secure Private or Public Registries

Often the enterprises have their own base images with proprietary software and libraries which they don’t want to distribute in public. Ensure the image is hosted on a secure and trusted registry to prevent unauthorized access. Use a TLS certificate with trusted Root CA, and implement strong authentication to prevent MITM attack.

6. Do not use privileged or root user to run the application in a container

This is the most common misconfiguration in the containerized workload. With principles of least privileges in mind, create an application user and use it to run the application process inside the container. Why not root? The reason is that a process running in a container is similar to the process running on the host operating system except for the fact that it has additional metadata to identify that it is part of a container. With UID and GID of root user in a container, you can access and modify the files written by root on the host machine.

Note – If you don’t define any USER in the Dockerfile, it generally means that the Container will be running with root user.

7. Implement image vulnerability scanning in CI/CD

When designing CI/CD for the container build and delivery, include image scanning solution to identify vulnerabilities (CVEs) and do not deploy exploitable images without remediation. Tools like Clair, Synk, Anchore, AquaSec, Twistlock can be used. Some of the container registries like AWS ECR, Quay.io are equipped with scanning solutions – do use them.

8. Enable kernel security profiles like AppArmor

AppArmor is a Linux security module to protect OS and its applications from security threats. Docker provides default profile to allow the program to a limited set of resources like network access, kernel capabilities, and file permissions, etc. It reduces the potential attack surface and provides a great in-depth defense.

9. Secure centralized and remote logging

Usually, the containers log everything on STDOUT, and these logs are lost once they are terminated, it is important to securely stream the logs to a centralized system for audit and future forensics. We also need to ensure that this logging system is secured and there is no data leakage from the logs.

10. Deploy runtime security monitoring

Even if you deploy vulnerability scanning solutions based on repository data and take all necessary precautions, there is still a chance of being victimized. It is important to continuously monitor and log the application behavior to prevent and detect malicious activities.

There is no silver bullet solution with Cyber Security, a layered defence is the only viable defence. – ICIT Research

By implementing the above best practices, you can make it harder for the attacker to find ways to exploit your system. I am pointing out some tools and references that can be used to audit and secure the containers. Security is a vast topic, we haven’t covered Kubernetes specific controls in this article but stay tuned, we can have a follow-up article focussing on the Kubernetes security best practices.

Tools

To simplify the adoption of Security controls, I am suggesting few opensource and commercial offerings which can be used to discover the current state, to generate advisories for your workload.

- docker-bench-security – Official tool by Docker itself to audit the container workload according to the CIS Benchmark for Docker which is an industry-standard benchmark.
- Hadolint Linter for Dockerfile – Use the linter to do static code analysis of the Dockerfile. The linter helps in implementing the best practices. It can be integrated with popular code editors and integration pipelines.
- Clair – Clair is a popular static vulnerability scanning tool for the application container. It sources metadata from the various vulnerability databases on a regular basis. Alternatives are Anchore, Synk, Trivy.
- OWASP Cheatsheet – OWASP is an open community which is quite popular among security experts. This cheat sheet is a good starting point to start with.
- OpenSCAP for Container – Security Content Automation Protocol (SCAP) is a multi-purpose framework of specifications that supports automated configuration, vulnerability and patch checking, technical control compliance activities, and security measurement. It implements NIST standards.
- Sysdig Falco – Falco can be used to implement runtime security. It uses efficient eBPF to intercept calls and traffic for realtime monitoring and forensics. As hackers continue to evolve, new vulnerabilities are discovered and often not picked up by static scanning tools. A solution with machine learning capability, continuous behavioral monitoring, and advanced AI/ML-based engines can’t be ignored from the list of essentials.
Commercial offerings from AquaSec, Twistlock, Sysdig, Synk, Qualys for Enterprise-grade security tools, and solutions.

References

Do comment if you have any interesting security incident or a preventable hack involving Containers you want to share with the community.

The post The Ten Commandments of Container Security appeared first on InfraCloud Technologies.