Forem: CloudRaft

Heroku to Kubernetes Migration: Clock is ticking

Anjul Sahu — Wed, 11 Feb 2026 00:00:00 +0000

For years, Heroku has been a beloved starting point for countless high-growth companies. It was revolutionary, making the deployment of an idea almost trivial. That focus on the developer experience—on simply pushing code and having it run—is why so many successful Minimum Viable Products (MVPs) and early-stage platforms were born there. It allowed engineering leadership to focus on product-market fit (PMF) instead of infrastructure.

But a platform that simplifies everything also imposes limits, and for any company that has scaled past the initial bootstrap phase, those limits eventually hit two core metrics: control and cost. What starts as the fastest way to market often becomes a budget bottleneck and a strategic constraint.

Today, with new structural changes at Heroku, the conversation about migration is no longer a matter of "if" or "when," but "now." For any business running a production-critical, profitable service, moving to Kubernetes is no longer just an optimization—it’s a necessary step to secure the next decade of growth and maintain technical sovereignty.

Understanding the Shift at Heroku

On February 6, 2026, Heroku announced a significant strategic realignment. The platform is now transitioning to what they call a sustaining engineering model.

What does that actually mean for you as a business? It means a shift in investment priority. Heroku remains a stable, production-ready environment, with continued focus on core areas like security, stability, reliability, and support. For existing credit card-paying customers, the day-to-day operations and services remain unchanged.

The critical piece of news, however, is that Enterprise Account contracts will no longer be offered to new customers. While existing enterprise contracts will be honored, this decision sends a clear strategic signal: Salesforce, the parent of Heroku, is focusing its future engineering efforts elsewhere—specifically on helping organizations build and deploy enterprise-grade AI in a secure way, rather than focusing on the core, undifferentiated platform features that many growth companies rely on.

In short, the platform you relied on for your MVP is telling you, quite clearly, that its main focus is changing. For a high-growth business, relying on a platform that has decided to stop innovating in your core area of need is an unacceptable risk. The decision to migrate has now moved from a "good idea" to a strategic imperative.

Why is Kubernetes a good choice?

The cloud landscape has matured dramatically since Heroku first took center stage. While Heroku pioneered the developer-first experience, Kubernetes is already an industry standard and majority of the companies are already using it in Production. For any company that has achieved PMF, Kubernetes offers benefits that directly address the pain points of a scaled Heroku implementation. You may ask why not using products like Portainer, Render, Fly etc which have been an alternative. Yes, you can use them but it is still gaining more control on the platform and spending.

Reclaiming Sovereignty and Control

With Heroku, you are a tenant in a strictly controlled environment. That simplicity is powerful, but it comes at the cost of ultimate control. Kubernetes flips that dynamic. It gives you the blueprint for your entire infrastructure.

Multicloud and Hybrid Strategy: Kubernetes is a universal API for infrastructure. It provides the freedom to easily shift workloads between major cloud providers (AWS, GCP, Azure), deploy on-premise, or adopt a hybrid strategy. This ability to change providers is a powerful negotiating tool and a key piece of business continuity planning.
Enterprise Sales Enablement: For B2B SaaS, especially those with AI-native features, enterprise customers often require strict data sovereignty. They need to self-host services on their own virtual private clouds or on-premise. Heroku architecture simply cannot support this. A Kubernetes-based platform enables you to offer a self-deployed version of your SaaS product, unlocking massive new markets in highly regulated or security-conscious industries. The control Kubernetes offers over data residency and compliance is non-negotiable for selling to large enterprise customers.

Scalability and Cost Efficiency

The Heroku pricing model is famously straightforward: it’s easy to calculate, but it is expensive as you scale. This is the trade-off for simplicity.

By moving to Kubernetes, you gain fine-grained control over resource allocation. You can right-size your instances, consolidate workloads, and select the most cost-effective machine types for specific tasks. While the initial setup requires more attention, the long-term cost savings are significant, especially for services with unpredictable or high-volume usage.

The ecosystem itself has worked to smooth out the initial complexity. Major cloud providers now offer "autopilot" in their managed Kubernetes services that handle much of the underlying operational overhead. This means you can gain the cost and control benefits of Kubernetes without the burden of building a huge platform engineering team.

At CloudRaft, we recognize the need to simplify this process. We’ve built an accelerator called TurboRaft that is essentially a proven playbook for the modern Kubernetes platform. It includes:

GitOps with ArgoCD: For zero-touch, automated, and auditable releases.
Security: Secured secret management, automated certificate management, SAST, SBOMs and vulnerability management.
Observability: Open-source monitoring with options to choose from and alerting to keep costs low while maintaining deep insight.
Governance: Clear policies enforced for compliance and cost control.

The goal is to deliver the "Heroku-like" ease of use for developers, but on a platform you own and control.

Maturation of the Kubernetes ecosystem

A few years ago, managing Kubernetes was a job for seasoned experts. Today, the complexity angle has been largely mitigated by a robust and mature ecosystem. Open-source tooling, managed cloud services, and a deep community knowledge base have all contributed to making K8s a practical and reliable choice.

The old argument that "Kubernetes is too complex" is mostly obsolete for a growing company. The market has solved the hardest parts. What’s left is a highly stable platform that provides the operational rigor required to run business-critical services. The Hacker News discussion thread on the Heroku news highlights this exact sentiment, with many leaders realizing that the ecosystem is ready for them.

A structured approach to migration

No platform migration is easy; it’s a non-trivial engineering effort that must be planned as a business-critical project. Done correctly, it is an opportunity to not just move your app, but to make it stronger and more resilient for the future.

Step 1: Assessment and Re-Architecture

This is the most crucial phase. A migration should also be seen as a refactoring opportunity. If your application isn't strictly following cloud-native principles or the Twelve-Factor App methodology, now is the time to correct it.

Risk Identification: We begin with a full risk assessment, examining each service in the application. We categorize them by current stability, coupling, and size to create a phased migration plan.
Sizing and Cost Modeling: Understanding the true resource needs of each service allows us to create accurate Kubernetes deployment specifications and a detailed cost projection for the new platform.

Step 2: Simplifying the Developer Experience

The biggest win of Heroku was the abstraction of infrastructure. We need to replicate that ease of use on Kubernetes. Developers should not need to become Kubernetes experts overnight.

We convert services into Kubernetes deployments using Helm charts, then we abstract the low-level Kubernetes constructs. The goal is a simplified interface—whether it’s a basic YAML or JSON configuration—that lets developers manage their application settings without worrying about the underlying cluster management. This retains the core developer efficiency that made Heroku so appealing.

Step 3: The Data Migration Challenge

Applications are often the easy part; the database is where the real complexity lies. A successful migration requires a strategy for moving data with near-zero downtime.

We strongly recommend self-hosted database solutions on Kubernetes, particularly CloudNativePG for PostgreSQL. Running your own highly-available, self-managed database on Kubernetes removes the premium cost of proprietary cloud-managed services while providing superior control over failover and disaster recovery. We’ve found CloudNativePG to be highly reliable and offer full consulting and support to ensure a smooth, near-zero-downtime data migration. The database upgrade and management was easy in Heroku and with CloudNativePG and our best practices, you can have the database on auto pilot.

Time to act is now

The shift at Heroku is a clear alarm bell. Ignoring it means accepting escalating costs and a growing strategic risk. You now have a proven, mature, and cost-effective alternative in Kubernetes.

Success in this migration hinges on two things:

Selecting a Proven Playbook: You need a tested, end-to-end framework that accounts for application, database, and operational complexities.
The Right Team: You need a partner who has navigated this journey before and can deliver the platform quickly, abstracting away the unnecessary complexity while leaving you with full control.

This is where CloudRaft comes in. We offer not just the accelerator, but the consulting and operational support to execute the migration and hand over a platform that is ready for enterprise-level growth. Don't wait until the cost pressure or strategic uncertainty becomes a crisis—secure your future with a modern, controlled, and cost-efficient Kubernetes platform today.

Context Graphs for AI Agents: The Complete Implementation Guide

Anjul Sahu — Thu, 29 Jan 2026 00:00:00 +0000

Why Context Graphs Matter Now for AI Agents?

In the past few months, AI has shifted from chatbots to agents, autonomous systems that don't just answer questions but make decisions, approve exceptions, route escalations, and execute workflows across enterprise systems. Foundation Capital recently called this shift AI's "trillion-dollar opportunity," arguing that enterprise value is migrating from traditional systems of record to systems that capture decision traces, the "why" behind every action.

But here's the problem: agents deployed without proper context infrastructure are failing at scale, with customers reporting "1,000+ AI instances with no way to govern them" and "all kinds of agentic tools that none talk to each other" as stated in Metadata Weekly. The issue isn't the AI models themselves, it's that agents lack the structured knowledge foundation they need to reason reliably.

The Missing Infrastructure: Relationship-Based Context

47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024 MIT Sloan Management Review. Even when agents don't hallucinate outright, they struggle with multi-step reasoning that requires connecting distant facts across systems. An agent might know a customer filed a complaint and know about a recent product defect and know the refund policy, but fail to connect these relationships to understand why an exception should be granted.

As Prukalpa Sankar, co-founder of Atlan, frames it: "In 2025, in the dawn of the AI era, context is king" in her article. Context Graphs provide this missing infrastructure by organizing information as an interconnected network of entities and relationships, enabling AI agents to traverse meaningful connections, reason across multiple facts, and deliver explainable decisions.

This comprehensive guide explains what Context Graphs are, how they work, and why they're becoming essential infrastructure for enterprise AI.

What is a Context Graph? Definition, Use Cases & Implementation Guide

How Context Graphs Work

Context Graphs transform raw data into a semantic network of nodes (entities like people or projects), directed edges (relationships such as "worked_on" or "depends_on"), and properties (key-value details on both). This structure enables AI agents to perform graph traversals, starting from a query node and following relevant edges, for dynamic context assembly and multi-hop reasoning, unlike rigid keyword or vector searches.

Core Components:

Nodes: Represent real-world entities (e.g. "ProjectX"). Each holds properties like name, type, or timestamp.
Edges: Directed connections with types (e.g. → "worked_on" →) and properties (e.g. role: "lead", duration: "6 months"). Directions indicate flow, like cause-effect.
Properties: Metadata attached to nodes/edges (e.g., confidence score on an edge), enabling filtered traversals.

Traversal Process:

Query Entry: Input like "API security projects" matches starting nodes via properties or embeddings.
Neighbor Expansion: Fetch adjacent nodes/edges, prioritizing by relevance (e.g., recency, strength).
Multi-Hop Pathfinding: Traverse 2-4 hops (e.g. Project → worked_on → Engineer → similar_to → AuthSystem), using algorithms like BFS or HNSW-inspired graphs for efficiency.
Context Assembly: Aggregate paths into a subgraph, feeding it to LLMs for grounded reasoning.
Explainability: Log the path for auditing.

This mirrors vector DB indexing (e.g. HNSW in Pinecone) but emphasizes relational paths over pure similarity.

Example in Action:

Traditional Vector Search (e.g., Pinecone nearest-neighbor): "API security projects" → Returns docs with similar embeddings (e.g. 3 keyword matches).

Context Graph Traversal:

# sample cypher query
MATCH (p:Project)-[:RELATED_TO]->(t:Topic {name: 'API Security'})-[*1..3]-(related) RETURN *

Start: Projects tagged "API Security".
Hop 1: → worked_on_by → Engineers (properties: skills="OAuth").
Hop 2: Engineers → also_worked_on → AuthSystems.
Hop 3: AuthSystems → depends_on → OAuthProtocols (properties: version="2.0").
Output: Subgraph with projects, team, deps, contributors—plus path visualization for explainability.

Key Characteristics of Context Graphs

Relationship-Centric Design: Context Graphs prioritize connections over isolated records. This makes it natural to understand how concepts relate, not just what they contain.
Multi-Hop Reasoning: The graph structure enables AI to connect distant concepts through intermediate relationships, reasoning across multiple steps just as humans do. Example: Connecting "customer complaint" → "product defect" → "supplier issue" → "quality control process" in three hops.
Dynamic Context Assembly: Rather than retrieving fixed search results, Context Graphs assemble context on the fly by traversing only the relationships relevant to your specific query.
Built-in Explainability: Every AI decision can be traced back through its relationship path. You can see exactly how the system reached a conclusion, critical for enterprise and regulated environments.
Temporal Intelligence: Context Graphs model sequences, dependencies, and cause-and-effect relationships over time, making them ideal for understanding evolving processes and events.
Enterprise Scalability: Modern graph databases handle millions of entities while maintaining fast traversal and query performance at scale.

Context Graph vs Knowledge Graph vs Vector Database

Feature	Context Graph	Knowledge Graph	Vector Database
Primary Focus	Contextual relationships for AI reasoning	General knowledge representation	Semantic similarity matching
Reasoning Type	Multi-hop traversal	Structured queries	Nearest neighbor search
Best For	Dynamic AI context assembly	Structured domain knowledge	Semantic search, RAG
Explainability	High (shows relationship paths)	Medium	Low (similarity scores only)
Query Complexity	Complex multi-step reasoning	Medium complexity	Simple similarity queries

Note: These technologies complement each other. Many advanced AI systems use Context Graphs for reasoning combined with vector databases for semantic search.

Real-World Context Graph Use Cases

Enterprise Knowledge Management: Connect projects, people, decisions, and outcomes across your organization. Instead of finding where files live, trace how work evolved, what decisions shaped results, and who has relevant expertise. This will reduce your knowledge discovery time.

Intelligent Customer Support: Go beyond keyword matching. Connect customer history, product configurations, known issues, and documented resolutions to provide contextually accurate answers. This will reduce your ticket resolution time.

Scientific Research & Discovery: Connect millions of research papers, creating networks of studies, methodologies, findings, and citations. Discover unexpected connections between seemingly unrelated fields. You can identify underexplored research areas by analyzing relationship patterns and citation gaps.

Compliance & Risk Management: Map relationships between regulations, internal policies, business processes, and controls. When requirements change, trace exactly where those changes affect systems and workflows. This will reduce your compliance audit preparation time.

Healthcare Diagnostics: Connect symptoms, medical history, medications, genetic factors, and research findings. Enable diagnostic systems to reason across these relationships and identify conditions that isolated analysis might miss. This will improve diagnostic accuracy by surfacing relevant but non-obvious connections.

Supply Chain Optimization: Model your entire supply network, suppliers, components, products, logistics partners, enabling sophisticated scenario analysis and rapid disruption response. For example, when supply issues arise, it will quickly identify alternative suppliers by traversing compatibility, certification, and performance relationships.

Legal Research & Analysis: Map relationships between cases, statutes, legal principles, and precedents. Trace how legal concepts evolved across jurisdictions and time periods. This would reduce legal research time.

Personalized Recommendations: Go beyond "customers who bought this also bought that." Understand topical relationships, creator connections, and contextual relevance to deliver truly personalized recommendations. This would increase engagement through unexpected but relevant discoveries.

Financial Risk Assessment: Model relationships between entities, transactions, accounts, and market factors. Detect complex fraud patterns spanning multiple accounts and understand how risks cascade through connected entities. This would detect more fraud patterns than traditional rule-based systems.

Software Development Intelligence: Map relationships between functions, modules, dependencies, documentation, and issues. Understand how code changes ripple through your system before making modifications. This would reduce breaking changes through comprehensive impact analysis.

Benefits of Context Graphs for AI Agents

Reduce AI Hallucinations: Ground AI outputs in explicit, verifiable relationships rather than probabilistic pattern matching alone.
Improve Reasoning Accuracy: When answers require connecting multiple facts across domains, Context Graphs significantly outperform retrieval-only approaches.
Enable Explainable AI: Expose the exact path the AI took through your knowledge graph, making decisions transparent and auditable.
Scale Without Schema Rigidity: Add new entity types and relationships without forcing disruptive schema migrations.
Surface Hidden Insights: Discover patterns and connections that are nearly impossible to detect in traditional table or document structures.
Maintain Context Across Interactions: Preserve relationship context throughout multi-turn conversations, enabling more sophisticated AI interactions.

How to Implement Context Graphs

Step 1: Select Your Graph Database

Choose based on scale, query patterns, and infrastructure:

Some Popular Options:

Neo4j: Most mature, enterprise-ready, excellent query language
Amazon Neptune: Managed AWS service, good for existing AWS infrastructure
TigerGraph: Best for massive scale and complex analytics
ArangoDB: Multi-model database with graph capabilities
FalkorDB: Ultra-fast in-memory graph database built on Redis, best for low-latency real-time applications

Decision Factors: Query complexity, data volume, team expertise, budget

Step 2: Design Your Relationship Schema

The value of a Context Graph depends on modeling the right entities and relationships.

Best Practice: Collaborate closely with domain experts who understand:

What entities matter in your domain
Which relationships drive important decisions
How information flows through your processes

Example Schema (Customer Support):

Entities: Customer, Ticket, Product, Issue, Resolution, Agent
Relationships: reported_by, relates_to, resolved_with, escalated_to, similar_to

Step 3: Build Entity Extraction

Identify entities in your source data:

For Unstructured Text:

Use NLP pipelines
Fine-tune LLMs for domain-specific entity recognition
Implement human-in-the-loop validation for critical entities

For Structured Data:

Map existing database fields directly to graph entities
Normalize entity references across systems

Step 4: Develop Relationship Extraction

Beyond identifying entities, determine how they relate:

Approaches:

Rule-based: Define explicit patterns (if X mentions Y in context Z, create relationship R)
ML-based: Train models to identify relationship types from text
LLM-based: Use large language models for sophisticated relationship inference
Human validation: Review critical relationship paths

Step 5: Enable Real-Time Updates

Context Graphs are living systems requiring continuous updates:

Implement event-driven architecture for data changes
Design incremental update patterns (don't rebuild everything)
Maintain data lineage for troubleshooting
Build conflict resolution for concurrent updates

Step 6: Optimize Query Performance

Keep multi-hop queries responsive at scale:

Index critical properties used in traversals
Cache frequent query patterns
Limit traversal depth for expensive queries
Denormalize selectively for performance-critical paths
Use query profiling to identify bottlenecks

Step 7: Integrate Graph Analytics

Enhance your Context Graph with advanced algorithms:

PageRank: Identify influential nodes
Community Detection: Find clusters of related entities
Path Finding: Discover optimal routes through relationships
Graph Embeddings: Enable similarity calculations
Link Prediction: Suggest missing relationships

Implementation Challenges & Solutions

Challenge	Why It Matters	Practical Solution
Graph Construction Complexity	Building comprehensive graphs requires sophisticated entity and relationship extraction from unstructured data	Start with a focused domain where you have high-quality structured data. Expand gradually as you build extraction capabilities.
Schema Design Expertise	Effective schemas demand deep domain understanding, poor design leads to unusable graphs	Run workshops with subject matter experts. Build iteratively: start simple, refine based on actual query patterns.
Performance at Scale	Graph traversals become expensive for complex multi-hop queries as data grows	Invest in proper indexing, implement query optimization, use caching strategically, and set traversal depth limits (2-4 hops).
Entity Resolution	Identifying that different mentions refer to the same entity is difficult but critical for accuracy	Implement fuzzy matching, leverage unique identifiers where available, use ML-based entity resolution tools, maintain a golden record system.
Quality Maintenance	As graphs grow to millions of relationships, maintaining accuracy becomes challenging	Implement automated validation rules, schedule periodic audits, track data lineage, enable user feedback loops for corrections.
Integration Complexity	Incorporating Context Graphs into existing systems requires architectural changes and API design	Build a graph API layer that existing systems can call. Start with read-only integration, add write capabilities once proven.
Skill Gap	Shortage of professionals experienced in graph technologies and query languages like Cypher	Train existing team members (graph databases are learnable, similar to SQL), hire contractors for initial setup, or partner with CloudRaft for implementation guidance.
Cost Management	Context Graphs add infrastructure costs for databases, extraction pipelines, and real-time analytics	Start with a high-value use case to demonstrate ROI. Scale infrastructure based on actual usage patterns. Monitor cost per query and optimize expensive operations.

Context Graph Best Practices

Design Principles

Model relationships that drive decisions: Don't create relationships just because you can. Focus on connections that enable valuable reasoning.
Keep entity types focused: Avoid creating overly granular entity types. Each entity type should represent a meaningful concept in your domain.
Make relationships meaningful: Generic relationships like "related_to" provide little value. Use specific relationship types: "depends_on," "caused_by," "replaces."
Balance normalization and performance: Highly normalized graphs are elegant but can be slow. Denormalize strategically for frequently traversed paths.
Version your schema: Graph schemas evolve. Maintain version history and migration paths.

Query Optimization

Limit traversal depth: Set maximum hops to prevent runaway queries. Most valuable relationships are within 2-4 hops.
Filter early: Apply constraints as early as possible in your traversal to reduce the working set.
Use indexed properties: Index properties you filter on frequently. This dramatically improves query performance.
Cache common patterns: Identify frequently executed query patterns and cache results with appropriate TTLs.

Data Quality

Implement validation rules: Define constraints on entity properties and relationship validity to maintain quality automatically.
Track provenance: Know where each entity and relationship came from. This enables troubleshooting and quality assessment.
Enable feedback loops: Allow users to report incorrect relationships. Use this feedback to improve extraction pipelines.
Schedule audits: Periodically review graph quality, especially for critical relationship paths.

Context Graphs + LLMs: A Powerful Combination

Context Graphs and Large Language Models (LLMs) complement each other:

Graph-Augmented Generation (GAG): Retrieve relevant subgraphs from your Context Graph and provide them as structured context to LLMs. This reduces hallucinations and grounds responses in your actual knowledge.

LLM-Assisted Graph Construction: Use LLMs to extract entities and relationships from unstructured text, building your Context Graph more quickly than rule-based approaches alone.

Explainable LLM Reasoning: When LLMs generate responses based on graph context, you can trace exactly which relationships influenced the output.

Hybrid Retrieval: Combine vector search (for semantic similarity) with graph traversal (for relationship reasoning) to get the best of both approaches.

Measuring Context Graph Success

Track these metrics to assess your Context Graph implementation:

Query Performance

Response time: Median and 95th percentile query latency
Throughput: Queries per second at peak usage
Cache hit rate: Percentage of queries served from cache

Data Quality

Entity accuracy: Percentage of correctly identified entities
Relationship precision: Percentage of relationships that are actually valid
Coverage: Percentage of domain knowledge captured in the graph

Business Impact

Time saved: Reduction in research/discovery time
Accuracy improvement: Better decision quality from enhanced reasoning
Cost reduction: Decreased manual effort for knowledge work
User satisfaction: NPS or satisfaction scores for graph-powered features

AI Performance

Hallucination rate: Reduction in factually incorrect AI outputs
Reasoning accuracy: Percentage of multi-hop questions answered correctly
Explainability: Percentage of AI decisions with traceable reasoning paths

The Future of Context Graphs

Context Graphs are evolving rapidly:

Emerging Trends

Graph + Vector Hybrid Systems: Combining semantic vector search with graph reasoning for more sophisticated AI systems.
Automated Schema Evolution: ML systems that automatically suggest new entity types and relationships based on usage patterns.
Real-Time Graph Analytics: Stream processing for graph updates and real-time pattern detection.
Multi-Modal Graphs: Incorporating images, audio, and video as first-class entities with rich relationships.
Federated Graphs: Connecting knowledge graphs across organizational boundaries while maintaining privacy and security.

Getting Started with Context Graphs

Ready to implement Context Graphs in your AI systems?

Start Small, Think Big

Identify a high-value use case where relationship reasoning matters
Map your initial schema with domain experts (10-20 entity types is plenty to start)
Build a proof of concept with a subset of your data
Measure impact against your baseline approach
Iterate and expand based on what you learn

Common Starting Points

Customer support: Connect tickets, customers, products, and resolutions
Internal knowledge: Link documents, projects, people, and decisions
Compliance: Map regulations, policies, processes, and controls
Product development: Connect features, dependencies, bugs, and releases

Conclusion

Context Graphs represent a fundamental shift in how AI systems understand and reason about information. By capturing not just data, but the rich network of relationships that gives data meaning, they unlock AI capabilities that were previously unattainable:

More accurate reasoning through multi-hop traversal
Explainable decisions via traceable relationship paths
Reduced hallucinations by grounding in verifiable connections
Scalable knowledge management without rigid schema constraints

As AI becomes increasingly central to enterprise operations, Context Graphs will evolve from competitive advantage to foundational infrastructure. Organizations that build graph-based AI capabilities now will be well-positioned to lead in an AI-driven future.

The question isn't whether to adopt Context Graphs, it's when and where to start.

Expert Help with Context Graph Implementation

Building Context Graphs requires specialized expertise in graph databases, knowledge representation, and AI integration. CloudRaft provides complimentary AI consultations to help you:

Assess feasibility for your specific use cases
Design optimal schemas for your domain
Architect scalable infrastructure that grows with your needs
Integrate with existing AI systems seamlessly
Train your team on graph technologies

Frequently Asked Questions

What's the difference between a Context Graph and a Knowledge Graph?

Context Graphs are specialized knowledge graphs optimized for dynamic context assembly in AI systems. While knowledge graphs broadly represent domain knowledge, Context Graphs focus specifically on enabling AI reasoning through relationship traversal.

Can I use Context Graphs with vector databases?

Absolutely. Many advanced AI systems use both, vector databases for semantic similarity search and Context Graphs for relationship reasoning. This hybrid approach provides the best of both worlds.

How much data do I need to start?

You can start small. Even a few thousand entities with well-modeled relationships can demonstrate value. Focus on quality relationships over quantity.

What's the typical implementation timeline?

For a focused proof of concept: 4-8 weeks. For production-ready implementation: 3-6 months. Timeline depends on data complexity, schema design, and integration requirements.

Do I need specialized graph database skills?

While helpful, they're not mandatory. Graph query languages like Cypher (Neo4j) are learnable, similar to SQL. Consider training existing team members or partnering with experts for initial setup.

How do Context Graphs reduce AI hallucinations?

By grounding AI responses in explicit, verifiable relationships rather than relying solely on probabilistic pattern matching from training data. The AI can only traverse relationships that actually exist in your graph.

What's the ROI of implementing Context Graphs?

Varies by use case, but organizations typically see: reduction in knowledge discovery time, improvement in AI reasoning accuracy, and reduction in manual research effort. ROI is highest for knowledge-intensive workflows.

Can Context Graphs work with my existing databases?

Yes. Context Graphs complement existing databases. You can keep transactional data in relational databases and build Context Graphs for relationship reasoning, syncing data between systems.

Real-Time Postgres to ClickHouse CDC: Supercharge Analytics with PeerDB

Anjul Sahu — Thu, 27 Nov 2025 00:00:00 +0000

If you are running a heavy SaaS platform, you eventually hit a wall with PostgreSQL. It's fantastic for transactional data (OLTP), but when you try to run complex analytical queries on millions of rows, things slow down.

We recently tackled this exact problem for a client handling high-volume messaging operations. In one of our customers' analytics dashboards, they were using an AWS Aurora PostgreSQL setup to run analytical queries, and they needed a solution that was fast, reliable, and real-time.

Here is how we solved it by building a high-performance replication pipeline from Postgres to ClickHouse using PeerDB.

Why ClickHouse?

ClickHouse is the superior choice for analytics because it is a purpose-built OLAP database designed for high-performance data processing, unlike PostgreSQL, which is a row-based OLTP system better suited for transactional workloads. Its columnar storage architecture allows it to handle massive datasets with sub-millisecond latency, where standard Postgres deployments often hit performance walls. By switching to ClickHouse, you gain the ability to ingest millions of rows and execute complex analytical queries instantly, solving the performance limitations inherent in using PostgreSQL for analytics.

The CDC Landscape: Why We Chose PeerDB

Real-time Change Data Capture (CDC) is the standard for moving data without slowing down your primary database. But how do you implement it? Here are the primary CDC (Change Data Capture) options for replicating data from PostgreSQL to ClickHouse that we have considered in our implementation.

1. PeerDB

PeerDB is a specialised tool designed specifically for PostgreSQL to ClickHouse replication. It was the chosen solution in the provided design document due to its balance of performance and simplicity.

Architecture: It can run as a Docker container stack (PeerDB Server, UI, etc.) and connects directly to the Postgres logical replication slot.

Pros:

High Performance: PeerDB was found to be very performant as compared to other solutions.
Specialised Features: It handles initial snapshots (bulk loads) and real-time streaming (CDC) seamlessly. It also supports specific optimisations, such as dividing tables into multiple "mirrors" to speed up initial loads.
Simplicity: It avoids the complexity of managing a full Kafka cluster. Cons:
Community Edition Limits: The community edition lacks built-in authentication for the UI, requiring private network access or VPNs for security or another way to add authentication for the UI.

2. Altinity Sink Connector for ClickHouse

This is a lightweight, single-executable solution often used to avoid the complexity of Kafka. It is developed by Altinity, a major ClickHouse contributor.

Architecture: It runs as a standalone binary or within a Kafka Connect environment. It connects to Postgres and replicates data to ClickHouse.

Pros:

Operational Simplicity: It eliminates the need for a Kafka Connect cluster or ZooKeeper, running as a single executable.
Direct Replication: Offers a direct path from Postgres to ClickHouse.
Auto Schema: Can automatically read the Postgres schema and create equivalent ClickHouse tables.
Cons:
- Performance: In the referenced document, this option was tested but rejected because it did not meet the performance requirements compared to PeerDB.

3. Debezium and Kafka

This is the industry-standard approach for general-purpose CDC, involving a chain of distinct complex components.

Architecture: Postgres → Debezium (Kafka Connect) → Kafka Broker → ClickHouse Sink → ClickHouse.

Pros:

Decoupling: The message broker (Kafka) decouples the source from the destination, allowing multiple consumers to read the same stream.
Reliability: Extremely robust for guaranteed message delivery and exactly-once processing (if configured correctly). Cons:
High Complexity: Requires managing Zookeeper, Kafka Brokers, and Schema Registries. The provided document explicitly mentions avoiding "Kafka Connect framework complexity" as a goal.
Overhead: Significant infrastructure footprint compared to direct replication tools.

Why PeerDB?

We initially tested the Altinity connector but ultimately chose PeerDB. Mainly because of following reasons.

Performance: In our testing, PeerDB offered superior performance for our specific workload compared to other connectors we tried.
Specialisation: It is purpose-built for Postgres-to-ClickHouse replication, handling data type mapping and initial snapshots smoothly.

The Architecture

We opted for a "Keep It Simple" approach to infrastructure. While Kubernetes (EKS) is great, we deployed this on Amazon EC2 to maintain full control over the infrastructure and cost. If you have a team that can handle EKS for you, then that might be a better option. Please discuss with our team to find the right solutions for your workload and team.

The Setup:

Source: AWS Aurora (PostgreSQL)
The Pipeline: PeerDB running via Docker Compose
Destination: A ClickHouse cluster

High Availability Design

Source: Altinity

To ensure we never lost data, we configured a ClickHouse cluster with:

3 Keeper Nodes: Using m6i.large instances. These replace ZooKeeper for coordination
2 ClickHouse Server Nodes: Using r6i.2xlarge instances for heavy lifting
Replication: We used ReplicatedMergeTree to ensure data exists on multiple nodes for safety

ClickHouse Cluster

We automated the deployment using Ansible to configure the hardware-aware settings. A cool feature of our setup is that the configuration automatically calculates memory limits and cache sizes based on the EC2 instance's RAM (e.g., leaving 25% for the OS and giving 75% to ClickHouse). We wrote about this earlier in our previous blog post.

Installing PeerDB

We used Docker Compose to spin up the PeerDB stack. One specific nuance we encountered was configuring the storage abstraction. PeerDB uses MinIO (S3 compatible) for intermediate storage. We had to explicitly set the PEERDB_CLICKHOUSE_AWS_CREDENTIALS_AWS_ENDPOINT_URL_S3 environment variable in our docker-compose.yml to point to our MinIO host IP.

Set up the peers to connect with the source and destination.

Creating the "Mirror"

PeerDB uses a concept called Mirrors to handle the CDC pipeline. We set up the connection by defining:

The Peer (Source): Our Aurora Postgres instance
The Peer (Destination): Our ClickHouse cluster
The Mirror: The actual replication job

PeerDB support different modes of streaming - log based (CDC), cursor based (timestamp or integer) and XMIN based. In our implementation, we used log based (CDC) replication.

To optimise the initial data load, we didn't just dump everything at once. We divided our tables into multiple "batches" (mirrors) to run in parallel and started at different times so that we do not cause a high load on the source.

"Gotchas" From the Trenches

No migration is perfect. Here are three issues we faced so you can avoid them:

The "Too Many Parts" Error in ClickHouse

ClickHouse loves big batches of data. If PeerDB syncs records one by one or in tiny groups too quickly, ClickHouse can't merge the data parts fast enough in the background. We saw errors like Too many parts... Merges are processing significantly slower than inserts.

Fix: You may need to tune the batch size or frequency to slow down the inserts slightly, allowing ClickHouse's merge process to catch up.

Aurora Failovers Break Things

If AWS Aurora triggers a failover, the IP/DNS resolution might shift. We found that this can break the peering connection.

Fix: You have to edit the peer configuration to point to the new primary host and resync the mirror.

Security on Community Edition We used the community edition of PeerDB. Be aware that it does not have built-in authentication for the UI.

Fix: Do not expose the UI to the public internet. We access it via private IP/VPN or put an authentication layer using a third-party product.

Conclusion and Key Takeaways

By successfully moving analytical queries off the primary Postgres instance and into ClickHouse, we achieved the sub-millisecond query performance our client required. PeerDB provided us with a robust, real-time CDC solution without the operational headache of managing a Kafka cluster.

Key Takeaways on the Postgres + ClickHouse + PeerDB Combination:

Performance: You get the best of both worlds: PostgreSQL handles fast, reliable transactional (OLTP) workloads, while ClickHouse takes on complex analytical (OLAP) queries with unmatched speed. This separation prevents slow analytical queries from impacting your core application database.
Real-Time Simplicity: PeerDB acts as a purpose-built, high-performance bridge. It removes the need to deploy and manage a complex, multi-component CDC stack like Debezium and Kafka, significantly reducing infrastructure complexity and operational overhead.
Scalability: This architecture allows your analytics layer (ClickHouse) to scale independently from your transactional layer (Postgres), ensuring that as your data volumes grow, you maintain both OLTP stability and OLAP speed.
Cost-Effectiveness: By offloading analytical processing, you can often run a smaller, more cost-effective Postgres instance dedicated to its core function, while leveraging ClickHouse's efficiency for massive-scale querying.

Are you looking to improve your analytics pipeline? Please book a call with us to discuss your case.

Why high performance storage is important for AI Cloud Build

Anjul Sahu — Wed, 24 Sep 2025 00:00:00 +0000

The AI cloud market is experiencing exceptionally rapid growth worldwide, with the latest reports projecting annual growth rates between 28% and 40% over the next five years. It may reach up to $647 billion by 2030 as per various analyst reports. The surge in AI Cloud adoption, GPU-as-a-service platforms, and enterprise interest in AI “factories” has created new pressures and opportunities for product engineering and IT leaders. Regardless of which public cloud or private cluster you choose, one key differentiator sets each AI and HPC solution apart: the performance of storage.

While leading clouds often use the same GPUs and servers, the way data flows—between compute, network, storage, and persistent layers—determines everything from training speed to scalability. Understanding storage fundamentals will help you architect or select the right solution. We have previously covered how to build AI cloud solutions and with hands-on experience in this space, we would like to cover our thoughts around it in this article.

Business and technology leaders now recognize that real-world AI breakthroughs require infrastructure with high bandwidth, low latency, and extreme parallelism. As deep learning and data-intensive analytics move from labs to production, GPU clusters run ever-larger models on ever-growing datasets.

Why Does Storage Matter in AI Workloads?

Storage plays an important role across the entire AI lifecycle. Let’s look into all three major areas: data preparation, training & tuning, and inference.

Data Preparation

Key Tasks

Scalable and performant storage to support transforming data for AI use
Protecting valuable raw and derived training data sets

Critical Capabilities

Storing large structured and unstructured datasets in many formats
Scaling under the pressure of map-reduce like distributed processing often used for transforming data for AI
Support for file and object access protocols to ease integration

Training & Tuning

Key Tasks

Providing training data to keep expensive GPUs fully utilized
Saving and restoring model checkpoints to protect training investments

Critical Capabilities

Sustaining read bandwidths necessary to keep training GPU resources busy
Minimizing time to save checkpoint data to limit training pauses
Scaling to meet demands of data parallel training in large clusters

Inference

Key Tasks

Safely storing and quickly delivering model artifacts for inference services
Providing data for batch inferencing

Critical Capabilities

Reliably storing expensive to produce model artifact data
Minimizing model artifact read latency for quick inference deployment
Sustaining read bandwidths necessary to keep inference GPU resources busy

High Performance Storage is Critical in Checkpointing Process in AI Training

Checkpointing is a critical process in large-scale AI training, enabling models to periodically save and restore their state as training progresses. As model and dataset sizes expand into the billions of parameters and petabytes of data, this operation becomes increasingly demanding for storage infrastructure. Efficient checkpointing helps safeguard training progress against inevitable hardware failures and disruptions, while also allowing for fine-tuning, experimentation, and rapid recovery. However, frequent checkpointing can introduce performance overhead due to pauses in computation and intensive reads/writes to persistent storage, especially when distributed clusters grow to thousands of accelerators.

To address these challenges, modern AI storage architecture leverages strategies such as asynchronous checkpointing—where checkpoints are saved in the background, minimizing idle time—and hierarchical distribution, reducing bottlenecks by having leader nodes manage data transfers within clusters. The result is faster training throughput, lower risk of lost work, and more efficient use of compute resources. Optimizing for checkpoint size, frequency, and concurrent access patterns is vital to ensure high throughput and low latency, making high-performance scalable storage systems an indispensable foundation for reliable, cost-effective AI model training at scale. You can read more about it in this AWS article.

What Kind of Storage Is Needed for AI and HPC Workloads?

For AI and HPC workloads, the demands extend well beyond ordinary enterprise storage. Key requirements include:

Parallel File Systems: Multiple servers and GPUs need to access datasets at the same time. Systems such as Lustre, WEKA, VAST Data, CephFS, and DDN Infinia enable concurrent access, avoiding bottlenecks and improving throughput for distributed workloads.
High Throughput and Low Latency: Training GPT-like models or running simulations generates millions of read/write operations per second. Storage must deliver bandwidth in the tens to hundreds of GB/s and latency below 1ms, so that GPUs remain fed and productive.
POSIX Compliance: Many AI frameworks and HPC applications expect a traditional POSIX interface for seamless operation.
Scalability and Elasticity: Petabyte-scale capacity is the norm. Modern solutions allow you to scale horizontally, adding performance and capacity as demand grows.
Data Integrity and Reliability: Enterprise-grade AI and HPC workloads need uninterrupted access to their data. Redundancy, fault tolerance, and robust disaster recovery features matter.

Typical Storage Specifications and Requirements

For modern AI Cloud or AI factory, and GPU Cloud infrastructure, expect:

Bandwidth: 15–512 GB/s (or higher for top-tier solutions)
IOPS: From 20,000 (entry) up to 800,000+
Latency: Sub-1ms to 2ms for parallel file systems
Capacity: 100TB to multi-petabyte scale, often with tiering to object storage
Protocols: NFSv3/v4.1, SMB, Lustre, S3 (for hybrid and archival storage), HDFS, and native REST APIs

On-premises or hybrid deployments may include NVMe storage, CXL-enabled expansion, and advanced cooling for supporting high-density GPU clusters.

AI Lifecycle Stage	Requirements	Considerations
Reading Training Data	- Accommodate wide range of read BW requirements and IO access patterns across different AI models - Deliver large amounts of read BW to single GPU servers for most demanding models	- Use high performance, all-flash storage to meet needs - Leverage RDMA capable storage protocols, when possible, for most demanding requirements
Saving Checkpoints	- Provide large sequential write bandwidth for quickly saving checkpoints - Handle multiple large sequential write streams to separate files, especially in same directory	- Understand checkpoint implementation details and behaviors for expected AI workloads - Determine time limits for completing checkpoints
Restoring Checkpoints	- Provide large sequential read bandwidth for quickly restoring checkpoints - Handle multiple large sequential read streams to same checkpoint file	- Understand how often checkpoint restoration will be required - Determine acceptable time limits for restoration
Servicing GPU Clusters	- Meet performance requirements for mixed storage workloads from multiple simultaneous AI jobs - Scale capacity and performance as GPU clusters grow with business needs	- Consider scale-out storage platforms that can increase performance and capacity while providing shared access to data

Source: snia.org - John Cardente Talk

Storage options for AI Cloud and HPC Workloads

To achieve next-generation AI and HPC results, enterprises and product teams should evaluate both commercial vendors and open source platforms.

Open Source Parallel File Systems

Ceph (CephFS): Highly flexible, POSIX-compliant, scales from small clusters to exabytes. Used in academic and commercial AI labs for robust file and object storage. Many early stage AI factories are using solutions built on top of Ceph.
Lustre / DDN Lustre: Optimized for large-scale HPC and AI workloads. Used in many supercomputing and enterprise environments.
IBM Spectrum Scale (GPFS): High-performing parallel file system, widely used in science and industry.

Commercial AI and HPC Storage Solutions

VAST Data: Delivers extreme performance for AI storage, marrying parallel file system performance with the economics of NAS and archive. Vast has been very popular and adapted by popular AI Cloud players like CoreWeave and Lambda.
WEKA: Highly optimized metadata and file access for AI and multi-tenant clusters; helps overcome bottlenecks experienced in legacy systems. Similar to Vast, Weka has customers such as Yotta, Cohere, and Together.ai.
DDN: Industry leader for research, hybrid file-object storage, and scalable data intelligence for model training and analytics. DDN’s solutions, like Infinia and xFusionAI, focus on both performance and efficiency for GPU workloads.
Pure Storage, Cloudian, IBM, Dell: Also recognized for delivering enterprise-grade AI/HPC storage platforms.

Many solutions integrate natively with popular public clouds (AWS S3, Google Cloud Storage, Azure Blob)—enabling hybrid architectures and seamless data movement.

Product Examples and Use Cases

Ceph (Open Source): Used by research labs and private cloud teams to build petabyte-scale, resilient storage for AI and HPC clusters.
WEKA: Enterprise deployments often leverage WEKA for AI factories—a system with hundreds of GPUs running concurrent training jobs—thanks to its elastic scaling and metadata performance.
VAST Data: Designed to deliver high throughput for both small and large file operations, increasingly chosen for generative AI workloads and data-intensive analytics in fintech, healthcare, and media.
DDN: Supports hybrid deployment strategies; offers both parallel file system and object storage in a unified stack.

Parallel file systems such as Lustre and Spectrum Scale facilitate near-instant recovery, zero-data loss architectures, and compliance for regulated sectors.

Identifying the Best Storage for your needs

Because every cloud environment is unique, the first step in creating a distinctive solution is to establish a baseline through hardware benchmarking. MLCommons' benchmarking tools can be run directly on your hardware to gather reliable performance data.

The latest MLPerf Storage v2.0 benchmark results from MLCommons highlight the increasingly critical role of storage performance in the scalability of AI training systems. With participation nearly doubling compared to the previous v1.0 round, the industry’s rapid innovation is evident—storage solutions now support around twice the number of accelerators as before. The new iteration includes checkpointing benchmarks, which address real-world scenarios faced by large AI clusters, where frequent hardware failures can disrupt training jobs. By simulating such events and evaluating storage recovery speeds, MLPerf Storage v2.0 offers valuable insights into how checkpointing helps ensure uninterrupted performance in sprawling datacenter environments.

A broad spectrum of storage technologies took part in the benchmark—ranging from local storage, in-storage accelerators, to object stores—reflecting the diversity of approaches in AI infrastructure. Over 200 results were submitted by 26 organizations worldwide, many participating for the first time, which showcases the growing global momentum behind the MLPerf initiative. The benchmarking framework—open-source and rigorously peer-reviewed—provides unbiased, actionable data for system architects, datacenter managers, and software vendors. MLPerf Storage is a go-to resource for designing resilient, high-performance AI training systems in a rapidly evolving technology landscape.

Conclusion: Building Your AI Cloud and HPC Strategy

As the AI Cloud, GPU-as-a-service, and HPC landscape evolves, storage is no longer a background detail—it is the core differentiator for speed, scale, and future innovation. Vendor neutrality empowers you to architect best-of-breed systems, leveraging open-source foundations and integrating commercial solutions where they fit your needs. Every cloud or on-prem cluster will benefit from storage designed for AI and HPC, not just traditional workloads.

Ready for the next step? If you want to explore options, benchmark solutions, or design an optimized AI/HPC cloud, book a meeting with the CloudRaft team. Our experts bring hands-on experience from enterprise projects, migration strategies, and multi-vendor deployments, helping you maximize both infrastructure and business outcomes. Read more about our offering.

GitOps: ArgoCD vs FluxCD

Unnati Mishra — Fri, 02 Aug 2024 10:12:24 +0000

Getting Started with GitOps

In the fast-paced world of software development, organizations are constantly seeking ways to streamline processes and improve efficiency through automation. The shift from waterfall models to hyper-agile methodologies, coupled with the adoption of microservices architecture, has led to much faster software releases. GitOps has emerged as a powerful approach to enable this rapid deployment cycle, implementing a control-loop pattern often seen in Kubernetes.

GitOps offers a more consistent and reliable way to handle infrastructure and deployment. In this blog, we'll explore what GitOps is, why it's gaining popularity among DevOps teams, and take a closer look at popular GitOps tools like Argo CD and Flux CD.

What is GitOps?

GitOps, a combination of 'Git' and 'Operations', is an approach to continuous deployment for cloud-native applications. It uses Git as the single source of truth for declarative infrastructure and applications. This means storing and managing all configuration files that describe how our application should be deployed and run in Git repositories.

The core principle of GitOps is treating everything - from application code to infrastructure - as code that can be version-controlled and managed using Git. When changes are needed, instead of manually executing commands or scripts, we make changes to our Git repository. A controller then detects these changes and applies them to our infrastructure.

Benefits of GitOps

Consistency and Reliability: With GitOps, the entire system configuration is stored in version control, providing a clear, auditable record of what should be deployed.
Faster Recovery and Easier Rollbacks: In case of issues, rolling back to a previous state is as simple as reverting to a previous commit in the Git history.
Security: Git's central point of control allows for strict access controls and enforced code reviews before changes are applied.
Improved Developer Experience: Developers can use familiar Git workflows to manage infrastructure, bridging the gap between development and operations.
Visibility and Traceability: All changes are recorded in Git, providing a clear record of who changed what and when.
Increased Automation: Pushing changes to Git can automatically trigger deployments, reducing manual work and speeding up processes.
Environment Consistency: GitOps makes it easier to maintain consistency between different environments (development, staging, production).
Increased Productivity: DORA's research suggests teams can ship 30-100 times more changes per day, increasing overall development output by 2-3 times.
Availability: With all configuration data in Git, organizations can easily deploy the same Kubernetes platform across different environments, leading to better availability.

Argo CD vs Flux CD

When implementing GitOps for Kubernetes, two popular tools stand out: Argo CD and Flux CD. Both are excellent choices, but they have some differences. Here's a comparison of their features:

Feature	Argo CD	Flux CD
Kubernetes-native	Yes	Yes
UI	Rich web-based UI	Capacitor GUI dashboard
Multi-tenancy	Built-in	Limited
Helm support	Native	Via Helm Operator
Kustomize support	Native	Native
Sync Mechanism	Automatic sync	Controller-based sync
Rollback capabilities	Yes	Yes
Health status	Yes	Relies on Kubernetes status
Image Updater	Add-on	Built-in
Advanced Deployment Strategies	Integrated with Argo rollouts	Supported via Flagger

Comparison between ArgoCD and FluxCD

In the next section, we will have a look at a quick demo of Argo CD and Flux CD.

Argo CD

In this quick demo of Argo CD we will go through the step-by-step process of Argo CD installation on kubernetes cluster. We will use Argo CD to deploy a sample guestbook application.

Prerequisites

Kubernetes cluster

Kubectl installed and configured.

Configuration of the git repository

Argo CD Installation

To install Argo CD, we need to have a Kubernetes cluster and kubectl installed and configured. You can check out the guide to install kubectl here.

Create a namespace for Argo CD

kubectl create namespace argocd

Install Argo CD

kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

Access the Argo CD api server

Port-forward the Argo CD server service

kubectl port-forward svc/argocd-server -n argocd 8080:443

Get the initial password of the admin user to authenticate

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Use this password to log into the Argo CD UI using username admin at the forwarded port on the localhost, in this example, it is http://localhost:8080

Deploy a sample application - guestbook

To deploy an app, we need to create an Application object. The spec will have information such as the source of the Kubernetes manifests to deploy the application, destination Kubernetes cluster, namespace, and sync policy. You can also provide more image updater specs via annotations. In this example, we are not using an image updater.

    apiVersion: argoproj.io/v1alpha1  
    kind: Application  
    metadata:  
    name: guestbook  
    namespace: argocd  
    spec:  
    project: default  
    source:  
    repoURL: https://github.com/argoproj/argocd-example-apps.git  
    targetRevision: HEAD  
    path: guestbook  
    destination:  
    server: https://kubernetes.default.svc  
    namespace: guestbook  
    syncPolicy:  
    automated:  
    prune: true  
    selfHeal: true  
    syncOptions:  
    -  CreateNamespace=true

Create the application

kubectl apply -f application.yaml

After applying the Argo CD application, the Argo CD controller will automatically monitor and apply the changes in the cluster. You can monitor this from the UI or

kubectl get apps -n argocd

Flux CD

In this demo of Flux CD we will understand its installation. We will use flux CD to deploy the ‘fleet-infa’ repository.

Prerequisites

Kubernetes Cluster
GitHub Personal Access Token. If you need help generating GitHub token check out this guide.

Objectives

Bootstrap Flux CD on a Kubernetes Cluster.

Deploy a sample application using Flux.

Customize the application configuration through Kustomize patches.

Install the Flux CLI

The Flux command-line interface (CLI) is used to bootstrap and interact with Flux CD

curl -s https://fluxcd.io/install.sh | sudo bash

Export Your Credentials

Export your GitHub personal access token and username.

export  GITHUB_TOKEN=<your-token>
export  GITHUB_USER=<your-username>

Check Your Kubernetes Cluster

Ensure your cluster is ready for Flux by running:

flux check --pre

Flux Installation

To bootstrap using a GitHub repository, run:

    flux bootstrap github \  
    --owner=$GITHUB_USER \  
    --repository=fleet-infra \  
    --branch=main \  
    --path=./clusters/my-cluster \  
    --personal

Clone the Git Repository

Clone the fleet-infra repository to your local machine:

    git clone https://github.com/$GITHUB_USER/fleet-infra  
    cd fleet-infra

Add podinfo Repository to Flux

Create a git repository manifest pointing to the podinfo repository’s master branch:

    flux create source git podinfo \  
    --url=https://github.com/stefanprodan/podinfo \  
    --branch=master \  
    --interval=1m \  
    --export > ./clusters/my-cluster/podinfo-source.yaml

Commit and push the podinfo-source.yaml file to the fleet-infra repository:

    git add -A && git commit -m "Add podinfo GitRepository"  
    git push

Deploy podinfo Application

Create a Kustomization manifest to deploy the podinfo application:

    flux create kustomization podinfo \  
    --target-namespace=default \  
    --source=podinfo \  
    --path="./kustomize" \  
    --prune=true \  
    --wait=true \  
    --interval=30m \  
    --retry-interval=2m \  
    --health-check-timeout=3m \  
    --export > ./clusters/my-cluster/podinfo-kustomization.yaml

Commit and push the podinfo-kustomization.yaml file to the repository:

git add -A && git commit -m "Add podinfo Kustomization"  
git push

Watch Flux Sync the Application

Use the flux get command to watch the podinfo app:

flux get kustomizations --watch

Verify the Deployment

Check if podinfo has been deployed on your cluster:

kubectl -n default get deployments,services

GitOps best practices

Git Workflows: Separate application repositories from git workflow repositories. Also, avoid using long-lived branches from different environments.
Simplify your Kubernetes files: Use tools like Kustomize and Helm to make your Kubernetes files simpler and easier to manage. Use both together to avoid repeating yourself.
Handle secrets carefully: Do not use your passwords or secrets directly in your Git files even if they are encrypted. Instead, use tools that can fetch secrets when needed.
Separate Build and Deployment Processes: Separate your build process from your deployment process. Let your CI system build and test your app and then let GitOps handle the build and put it in a server.

Ephemeral Environments using GitOps

Ephemeral environments, also known as preview environments, are short-lived environments that allow developers to test and preview changes in a production-like environment before merging them into the main branch.

These environments are typically created automatically when a pull request is opened and destroyed when the pull request is closed.

In the context of Kubernetes, tools like Argo CD and Flux CD can automate the creation and management of ephemeral environments, making it easier to implement this practice in a GitOps workflow. For more information on how to implement preview environments on Kubernetes with Argo CD, check out this guide by Piotr Minkowski.

Conclusion

GitOps is a game-changer for managing infrastructure and applications. It boosts consistency, reliability, collaboration, and workflow. Tools like Argo CD and Flux CD exemplify how GitOps streamlines deployment and enhances efficiency. Our comparison shows the strengths and specific use cases of both tools, highlighting how they make GitOps implementation seamless and effective.

K3s vs Talos Linux

Unnati Mishra — Mon, 22 Jul 2024 10:40:46 +0000

Introduction

In the world of Kubernetes, choosing the right technology can make a big difference in how smoothly and efficiently our applications run. This is where focused Kubernetes distributions like K3s and Talos Linux stand out.

From large data centers to smaller devices on the edge, Kubernetes plays an important role in managing applications across various environments. As multiple businesses are using Kubernetes at the edge to run AI nowadays, specialized versions like K3s and Talos have come to tackle various operational challenges.

K3s is known for being lightweight and easy to install, which makes it great for places with limited resources like edge computing and IoT. Meanwhile, Talos provides a more secure environment and is used for large-scale setups.

In this blog, we will discuss how K3s and Talos fit into Kubernetes deployments and the differences between the two. This will help you make the perfect choice based on your needs and goals.

What is K3s?

K3s was developed by Rancher Labs and donated to the CNCF. K3s is packaged as a single, less than 40 MB, binary that reduces the dependencies and steps needed to install, run, and auto-update a production Kubernetes cluster.

It is a lightweight yet powerful Kubernetes distribution designed for production workloads across IoT devices or resource-restrained remote locations. The main aim of K3s is to streamline the installation and management of Kubernetes clusters. It is easy to install and highly available.

How is K3s different from Kubernetes?

K3s is lightweight compared to the full distribution of Kubernetes.
It has fewer dependencies.
It is easier to deploy and manage.
It uses fewer resources (i.e. CPU, RAM, etc).
It has fewer built-in features and extensions.

K3s is ideal for smaller resource-constrained deployments, edge computing, and IoT while Kubernetes is more suited for large, complex deployments that have high resource requirements such as big data, machine learning, and high-performance computing.

What is Talos Linux?

Talos Linux is a modern Linux operating system distribution, written in Golang, that has specifically been built for the purpose of Kubernetes infrastructure. It has been designed to serve as the foundation for Kubernetes clusters.

In Talos, the cluster is accessed through APIs, which reduces the need for secure shelling (SSH) and therefore reduces the chances of surface attacks. It also helps avoid unexpected issues by creating an immutable layer on top of physical servers. This ensures that all servers are identical and have the same setup. Since it is API-managed, it makes operations automated, straightforward, and scalable.

You can read more about Talos here post.

The Differences between K3s and Talos Linux

Feature	Talos Linux	K3s
Size	Small in size	Medium in size
Role	OS For running the Kubernetes cluster	Lightweight Kubernetes distribution
Installation and Setup	Complex setup however can be simplified.	Simple setup
Architecture	Minimal, immutable OS; no SSH access or shell; API-driven configuration and management	Lightweight, single-binary; integrates container runtime, networking, and storage
Security	Has a strong focus on security with an immutable file system, no interactive login (SSH), and API-driven interactions	Follows essential security practices like RBAC, TLS encryption, automatic updates
Resource Requirements	Requires sufficient resources for effective Kubernetes operation; not for resource-constrained environments	Low resource requirements; suitable for low-power devices like IoT and edge devices.
Scalability	Supports scalable Kubernetes clusters in production environments; handles large-scale deployments	Supports clustering and high availability; generally used for smaller-scale deployments
Management and Maintenance	Managed through APIs; automated management with minimal manual intervention; less frequent maintenance and patching due to immutable infrastructure	Simplified management with standard Kubernetes tools and interfaces; easy to update and maintain; suitable for environments requiring ease of management
Community and Support	Growing community focused on security and production-grade deployments; strong documentation, community forums, and resources.	Active community backed by Rancher Labs (part of SUSE); extensive documentation, community support, and commercial support options available through Rancher

Usage of K3s and Talos Linux

Used for lightweight and resource-constrained environments.
It is perfect for edge computing, IoT, development and testing scenarios.
Helps in easy management and faster deployments.
Good fit for edge devices due to its security, reliability, and immutable ideology.
It is an excellent option for deploying Kubernetes on bare metal servers.
It is highly suitable for enterprise-level Kubernetes clusters.
It supports cloud platforms and virtualization platforms as well.

Conclusion

The choice between K3s and Talos Linux hinges on their specific use cases and future needs. It can be observed that the demand for lightweight Kubernetes is rising significantly. Industries have started to embrace edge computing, IoT, and other resource-constrained environments, making the ability to efficiently manage applications with minimal infrastructure of extreme importance.

As the demand for lightweight and efficient Kubernetes solutions grows, K3s is all-set to play a crucial role in helping in seamless and scalable application management in resource-limited environments. Meanwhile, Talos Linux will continue to be a robust choice for enterprises prioritizing security and reliability.

To conclude, the choice between K3s and Talos Linux should be guided by specific deployment needs, resource availability, and security considerations. Organizations can effectively meet their Kubernetes deployment goals by understanding the strengths of each and choosing accordingly.

Expert Guide on Selecting Observability Products

Anjul Sahu — Sat, 13 Jul 2024 00:00:00 +0000

Guide to select Observability tools and products

In today's digital landscape, businesses are constantly striving to stay ahead of the curve. The ability to deliver exceptional customer experiences, maintain system reliability, and optimize performance has become a crucial differentiator. Enter observability – the linchpin of modern IT operations that empowers organizations to achieve operational excellence, drive cost-efficiency, and continuously enhance their services.

The rise of cloud-native architectures has revolutionized the way applications are built and deployed. These modern systems leverage dynamic, virtualized infrastructure to provide unparalleled flexibility and automation. By enabling on-demand scaling and global accessibility, cloud-native approaches have become a catalyst for innovation and agility in the business world.

However, this shift brings new challenges. Unlike traditional monolithic systems, cloud-native applications are composed of numerous microservices distributed across various teams, platforms, and geographic locations. This decentralized nature makes it increasingly complex to monitor and maintain system health effectively.

In this article, we'll explore the essential characteristics of a robust observability solution and provide guidance on selecting the right tools to meet your organization's unique needs.

Evolution in Observability Space

The evolution of observability over the last two decades has been characterized by significant technological advancements and changing industry needs. Let's explore this journey in more detail:

In the early 2000s, observability faced its first major challenge with the explosion of log data. Organizations struggled with a lack of comprehensive solutions for instrumenting, generating, collecting, and visualizing this information. This gap in the market led to the rise of Splunk, which quickly became a dominant player by offering robust log management capabilities. As the decade progressed, the rapid growth of internet-based services and distributed systems introduced new complexities. This shift necessitated more sophisticated Application Performance Management (APM) solutions, paving the way for industry leaders like DynaTrace, New Relic, and AppDynamics to emerge and address these evolving needs.

The dawn of the 2010s brought about a paradigm shift with the advent of microservices architecture and cloud computing. These technologies dramatically increased the complexity of IT environments, creating a demand for observability solutions that prioritized developer experience. This wave saw the birth of innovative platforms such as DataDog, Grafana, Sentry, and Prometheus, each offering unique approaches to monitoring and visualizing system performance. As we moved into the latter half of the decade, the industry faced a new challenge: skyrocketing observability costs due to the massive ingestion of Metrics, Events, Logs, and Traces (MELT). While monitoring capabilities had greatly improved, debugging remained a largely manual and time-consuming process, especially in the face of increasingly complex Kubernetes and serverless architectures. Some products like Datadog, Grafana, SigNoz, KloudMate, Honeycomb, Kloudfuse, Thanos, Coroot, and VictoriaMetrics tackled these new challenges head-on.

The early to mid-2020s have ushered in a new era of observability, characterized by innovative approaches to data storage and analysis. Industry standards like OpenTelemetry have gained widespread adoption, and products are now aligning with this standard. To optimize costs, observability pipelines are being used to filter and route data to various backends, automatically handling high cardinality data that was often a pain point at scale. We've also seen the adoption of high-performance databases like ClickHouse for monitoring purposes, often becoming the backend of choice for observability products. The emergence of eBPF technology has provided deep insights into system performance and inter-entity relationships. Due to the increased adoption of the Rust programming language for its high performance, some observability tools such as Vector and various agents have become lightweight and more efficient, allowing for further scalability. Products like Quickwit (see how Binance is storing 100PB logs) have introduced cost-effective and scalable solutions for storing logs and metrics directly on object storage. Perhaps most significantly, we're witnessing the integration of artificial intelligence into observability tools, enabling causal analysis and faster problem resolution. This AI-driven approach is helping organizations quickly narrow down issues in their increasingly complex environments, marking a new frontier in the observability landscape.

Systems are getting Complex

In the realm of modern, distributed systems, traditional monitoring approaches fall short. These conventional methods rely on predetermined failure scenarios, which prove inadequate when dealing with the intricate, interconnected nature of today's cloud-based architectures. The unpredictability of these complex systems demands a more sophisticated approach to observability.

Enter the new generation of cloud monitoring tools. These advanced solutions are designed to navigate the labyrinth of distributed systems, drawing connections between seemingly disparate data points without the need for explicit configuration. Their power lies in their ability to uncover hidden issues and correlate information across various contexts, providing a holistic view of system health.

Consider this scenario: a user reports an error in a mobile application. In a world of microservices, pinpointing the root cause can be like finding a needle in a haystack. However, with these cutting-edge monitoring tools, engineers can swiftly trace the issue back to its origin, even if it's buried deep within one of countless backend services. This capability not only accelerates root cause analysis but also significantly reduces mean time to resolution (MTTR).

But the benefits don't stop at troubleshooting. These tools can play a crucial role in refining deployment strategies. By providing real-time feedback on new rollouts, they enable more sophisticated deployment techniques such as canary releases or blue-green deployments. This proactive approach allows for automatic rollbacks of problematic changes, mitigating potential issues before they impact end-users.

As the cloud-native landscape continues to evolve, selecting the right monitoring stack becomes paramount. To maximize the benefits of modern observability, it's crucial to choose a solution that not only meets your current needs but also aligns with your future goals and the ever-changing demands of cloud-based architectures.

Essential Features of Robust Observability Solutions

In today's complex digital landscapes, selecting the right observability tools is crucial. Let's explore the key attributes that make an observability solution truly effective that aligns with the observability best practices.

Holistic Monitoring Capabilities

A comprehensive observability platform should adeptly handle the four pillars of telemetry data, collectively known as MELT:

Metrics: Quantitative indicators of system health, such as CPU utilization
Events: Significant system occurrences or state changes
Logs: Detailed records of system activities and operations
Traces: Request pathways through the system, illuminating performance bottlenecks

An ideal solution seamlessly integrates these data types, providing a cohesive view of your system's health.

Intelligent Data Analysis and Anomaly Detection

Modern systems often exhibit unpredictable behavior patterns, rendering static alert thresholds ineffective. Advanced observability tools employ machine learning to detect anomalies without explicit configuration, while still allowing for customization. By correlating anomalies across various telemetry types, these systems can perform automated root cause analysis, significantly reducing troubleshooting time.

Sophisticated Alerting and Incident Management

Real-time alerting is the backbone of effective observability. A top-tier solution should:

Alert on both customizable thresholds and AI-detected anomalies
Consolidate related alerts into actionable incidents
Enrich incidents with contextual data, runbooks, and team information
Intelligently route incidents to appropriate personnel
Trigger automated remediation workflows when applicable

To combat alert fatigue, the system should employ intelligent alert suppression, prioritization, and escalation mechanisms.

Data-Driven Insights

Analytics derived from telemetry data drive continuous improvement. Key metrics to track include Mean Time to Repair (MTTR), Mean Time to Acknowledge (MTTA), and various Service Level Objectives (SLOs). These insights facilitate post-incident analysis, helping teams prevent future issues and optimize system performance.

Extensive Integration Ecosystem

A versatile observability solution should seamlessly integrate with your entire tech stack:

Popular programming languages and frameworks
Open-source standards (OpenTelemetry, OpenMetrics, StatsD)
Container orchestration platforms (Docker, Kubernetes)
Security tools for vulnerability scanning
Incident management systems
CI/CD pipelines
Major cloud platforms
Team collaboration tools
Business intelligence platforms

Scalability and Cost Optimization

As applications grow in scale and complexity, managing observability costs becomes challenging. Look for tools that:

Identify underutilized resources and forecast future needs
Employ intelligent data sampling and retention policies
Efficiently handle high-cardinality data
Utilize cutting-edge technologies like eBPF for improved performance

Intuitive User Experience

An observability platform's UI/UX is critical for efficient debugging and insight gathering. Seek solutions offering:

Clear visualizations of system components and their relationships
Pre-configured dashboards for common scenarios
Easy integration with your existing stack
Comprehensive, user-friendly documentation
Ability to slice and dice visualizations and fast response time

Operational Simplicity

Scaling observability across an organization can be daunting. Look for platforms that:

Support "everything-as-code" for standardization and version control
Integrate smoothly with modern application platforms
Offer automation-friendly interfaces
Provide tools for managing observability at scale

Cost-Effective Data Management

As data volumes grow, intelligent data lifecycle management becomes crucial. Seek solutions offering:

Multi-tiered storage for different data types
Advanced compression and deduplication techniques
Intelligent data sampling strategies
Efficient handling of high-cardinality data

Alignment with Industry Standards

Choosing tools that support industry-standard protocols and frameworks (like OpenTelemetry, PromQL, and Grafana) ensures:

Easier integration with existing systems
Vendor-independent implementations
Flexibility to change backends without code modifications

Organizational Fit

When selecting an observability solution, consider your organization's unique needs:

System complexity and scale
User base characteristics
Budget constraints
Team skills and expertise

Prioritize platforms that cover your full stack, tying surface-level symptoms to root causes. Ensure the chosen solution integrates seamlessly with your current tech stack, DevSecOps processes, and team workflows. The ideal observability solution balances comprehensive insights with practical considerations, providing a powerful yet feasible tool for your organization's needs. Ideally, you want one or a few tools that are as effective as possible to justify their costs; you also want to avoid context switching. Let’s look at the key features of an effective application monitoring tool.

Conclusion

Selecting the ideal observability solution is a nuanced process that demands a deep understanding of your organization's unique ecosystem. It's not just about collecting data; it's about gaining actionable insights that drive meaningful improvements in your systems and processes.

The journey to effective observability requires a careful balance between comprehensive coverage and practical implementation. Your chosen solution should seamlessly integrate with your existing tech stack, enhancing rather than disrupting your current workflows. It's crucial to find a tool that not only provides rich, full-stack visibility but also aligns with your team's skills, your budget constraints, and your overall operational goals.

Remember, observability is a double-edged sword. When implemented effectively, it can provide unprecedented insights into your systems, enabling proactive problem-solving and continuous improvement. However, if not approached thoughtfully, it can lead to unnecessary complexity, spiraling costs, and a false sense of security. The risk of "running half blind" with suboptimal observability practices is real and can have significant implications for your operations and bottom line.

In this complex landscape, partnering with experts can make all the difference. CloudRaft, with its deep expertise in observability and extensive partnerships in the field, stands ready to guide you through this journey. Our experience can help you rapidly adopt and optimize modern observability practices, ensuring you reap the full benefits of these powerful tools without falling into common pitfalls.

By choosing the right observability solution and implementation approach, you're not just collecting data – you're empowering your team with the insights they need to drive innovation, enhance performance, and deliver exceptional user experiences. In today's fast-paced digital environment, that's not just an advantage – it's a necessity.

Authors:

Anjul Sahu: Anjul is a leading expert in observability and a thought leader. In the last one and half decades, he has seen all the waves, of how observability and monitoring have evolved in large-scale organizations such as Telcos, Banks, and Internet Startups. He also works with investors and product companies looking for advice on the current trends in observability.
Madhukar Mishra: Madhukar has over one decade of experience, building up the platform for a leading e-commerce company in India to a company that built Internet-scale products. He is interested in large-scale distributed systems and is a thought leader in developer productivity and SRE.

Linux Troubleshooting For SREs

Madhuri Malviya — Fri, 10 Nov 2023 00:00:00 +0000

Introduction

As a Linux user or administrator, understanding and mastering the art of troubleshooting is very crucial. Regardless of how well-designed and optimized your systems may be, issues are bound to arise from time to time. These can range from minor hiccups to critical problems that hinder the performance and availability of your Linux machines or containers. In this comprehensive article, we will explore real-life examples of performance issues and provide you with a collection of useful Linux commands to troubleshoot everything from CPU and IO to network and errors.

Common causes of performance issues

Performance issues can be caused by a variety of factors. Some common causes include insufficient memory or CPU resources, disk I/O bottlenecks, network congestion, inefficient code and bugs. In addition, misconfigurations, outdated software, and runaway or zombie processes can also impact performance.

The importance of Linux troubleshooting

By diligently troubleshooting and resolving the issues, you not only ensure the smooth operation of your systems but also minimize the MTTR (Mean Time to Repair) – the average time it takes to fix a problem.

Mastering Linux troubleshooting allows you to swiftly diagnose and resolve performance bottlenecks, errors, and other issues that can potentially disrupt your operations.

To efficiently diagnose and resolve problems on your Linux machines there are two widely used approaches, RED and USE methodologies.

RED Methodology

The RED methodology focuses on three indicators: Rate, Errors, and Duration, especially directed at request-driven systems such as modern web applications. The idea is to resolve any performance issue and provide smooth running of the application. Let us understand it with an example. Suppose your service is becoming unresponsive, to resolve this issue you first look into:

Rate: It measures the number of requests that the service receives per unit of time. An unexpectedly high request rate could be indicative of an increased load on the server, causing performance issues.

Error: This metric tracks the number of errors that occur during the processing of requests. When dealing with a slow server, monitoring for errors is crucial in identifying any issues or bugs within the server's processing logic. It would pinpoint the root cause of inefficiency and you will be able to resolve the issue.

Duration: It measures the time taken by the server to process each request. For a slow service, analyzing the duration helps identify the specific requests or processes that are taking longer than usual to complete. By identifying the slow-performing components, you can focus on optimizing those areas.

USE Methodology

The USE methodology focuses on identifying problems with system resources while using three criteria – Utilization Saturation and Error.

Utilization: In this, we monitor how resources are used, and whether they are being used to their fullest. High utilization of resources leads to slower performance as no more work can be accepted.

Saturation: When a process is waiting for a resource for a long time it leads to saturation. When dealing with a slow server, monitoring for saturation helps identify any backlogs or queues that are causing delays in processing requests.

Error: it looks for system warnings that pop up on your screen that can cause your system to hang, slow down, or crash. By examining error rates and types, you can pinpoint specific areas where errors are prevalent, helping you identify the root causes of the slowdown.

In the next section, we will explore real-life examples of common problems in Linux systems and walk through step-by-step solutions using powerful commands and techniques.

Essential troubleshooting commands and techniques

Let's deep dive into the different scenarios in which we can use the mechanisms of troubleshooting. In this article, we are using Ubuntu OS and Intel processor, if you are on a different system or architecture, the output will be slightly different.
Suppose you are on call and you have an incident to troubleshoot some performance issue on a Linux machine or Container. Don’t worry we got you covered for every problem you face! Here are some commands to come to your rescue.

top

This command provides real-time information about system resource usage, including CPU, memory, and running processes. You might get overwhelmed as to what to look for in this output.

$ top - 20:31:34 up 1 day,  6:05,  1 user,  load average: 0.50, 0.65, 0.57
      Tasks:  88 total,   1 running,  87 sleeping,   0 stopped,   0 zombie
%Cpu(s):  533.0 us,  242.0 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :    941.6 total,    214.6 free,    197.8 used,    529.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    577.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2180 ubuntu    20   0   17208   5580   3144 S   0.3   0.6   0:01.75 sshd
   6387 ubuntu    20   0   10776   3860   3288 R   0.3   0.4   0:00.01 top
      1 root      20   0  102004  10812   6096 S   0.0   1.1   0:05.79 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.03 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 slub_flushwq
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 netns

Load Average- indicates the average system load for past 1,5 and 15 mins, respectively.A load average of 1 represents a fully utilized single-core CPU. Higher values indicate an increasingly overloaded system.
Zombie Processes- the dead processes whose execution is completed but are still using system resources. If it is present then there are issues with the process management.
Cpu%- shows the amount of CPU resources being used. It should be well below 100%.
S- You can check the process state. If your web server is not responding at all then you’ll see "D" (uninterruptible sleep) state indicating the process is stuck.
RS (resident memory usage)- it indicates the amount of physical memory being used by the application. If the system is actively paging memory this value would be exceptionally high which indicates that the process is demanding more memory than what is physically available. This situation might lead to frequent swapping of memory between RAM and the swap space, causing a performance bottleneck.
SHR(shared memory)- If the value is high, it could suggest that the application is relying heavily on shared libraries or is engaging in unnecessary data sharing, leading to a resource restriction.
If there is a process that is using maximum virtual memory and because of that your system is becoming slow you can check which process has the maximum amount of virtual memory usage through the VIRT parameter.

sar

This command collects, reports, and saves system activity information over some time.

$ sar -n TCP,ETCP 1
Linux 5.15.0-88-generic (top-gerbil)       11/07/23        _x86_64_        (1 CPU)

21:29:02     active/s  passive/s  iseg/s    oseg/s
21:29:03         0.00      0.00       0.00         0.00

21:29:02     atmptf/s  estres/s  retrans/s   isegerr/s       orsts/s
21:29:03         0.00      0.00      0.00         0.00              0.00

21:29:03     active/s  passive/s    iseg/s    oseg/s
21:29:04         0.00      0.00       11.00        11.00

21:29:03     atmptf/s  estres/s   retrans/s isegerr/s       orsts/s
21:29:04         0.00      0.00      0.00         0.00             0.00

You can check the Network Interface Statistics through which we can get the metrics such as bytes transmitted and received, packet counts, errors, and drops that can be useful for monitoring the network performance.
You can also check the information regarding the process queues and scheduling activities through the PROCESS AND QUEUE STATISTIC parameter which can be used to resolve issues related to process management and scheduling.
active/s and passive/s- identify if your web server is not responsive. It can be because of a high no. of active connections or too many passive connections.If the active/s parameter is high that can indicate that there is a sudden spike in the traffic or due to DoS attack and if the passive/s is higher than usual it may mean that the incoming requests are not processed efficiently due to lack of resources.
retrans/s- identify network congestion or unreliable networks like if you are experiencing slow file transfer because the network is suffering from high rates of packet loss and rectifying that can help in reducing retransmission and increase file transfer speed.
estres/s- shows no. of current active connections so if your server is running slow you can optimize your server's capacity by ending the connections which are not required
orsts/s- tells about the sender's retransmission rate like if it's high then it possibly due to unreliable links and this suggests that we have a low QoS.
in-seg/s- tells that if the parameter is high then your server has a surge in request and this can affect your network infrastructure.

free

The free command is used to display the amount of free and used memory in the system, including both physical and swap memory.

~$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7828        1896        2996        1010        2935        4382
Swap:           16023           0       16023

shared- share how much memory is used by the shared libraries (it does not mean memory it refers to a specific type of software component that contains reusable code and data that multiple programs or applications can use. Shared libraries are loaded into memory when an application that depends on them is executed. If it's high then it means you may have high memory usage.
cached- indicate that memory is being used to cache frequently accessed files that means high I/O performance.

vmstat

This command is used to report virtual memory statistics.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 219264  20164 524172    0    0    17    28  352   55  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  346   94  0  1 99  0  0
 0  0      0 219264  20164 524176    0    0     0     0  296   82  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  321   71  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  348   83  0  1 99  0  0
 0  0      0 219264  20164 524176    0    0     0     0  324   75  0  0 100  0  0

r- indicates the number of processes running in the CPU. If your web server is slow and the value of r is high then it indicates that there are many processes in the CPU and they are all competing for the resources for their completion.
swpd, free, buff, cache, si, so- indicate the characteristics of memory such as how much memory is free or in cache and how much amount of memory is swapping in/out from the disk.
inand cs- indicate the number of interrupts and context switches per second. High value of cs tells us about the frequent switches that can decrease the CPU performance.
id and wa- indicate the percentage for how much time the CPU is idle and time spent in waiting for I/O operation. High value of wa can lead to slow CPU performance and high value of in indicate that the CPU is free and we can add some processes to increase the efficiency of the CPU

mpstat

You are working on a server running multiple applications that heavily rely on CPU resources. You noticed that some services are not responding as quickly as they should, and that there are occasional service disruptions.

Use mpstat command which will display CPU usage statistics for all available processors.

$ mpstat -P ALL
Linux 5.15.0-88-generic (top-gerbil)    11/07/23        _x86_64_        (1 CPU)

23:44:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
23:44:28     all    0.04    0.01    0.09    0.04    0.00    0.19    0.00    0.00    0.00   99.63
23:44:28       0    0.04    0.01    0.09    0.04    0.00    0.19    0.00    0.00    0.00   99.63

%usr- percentage of CPU time spent on user-level processes. If this is unusually high, it would indicate that certain user applications or processes are consuming excessive CPU resources.
%sys - percentage of CPU time spent on system processes. If this parameter is high, it suggests that the kernel or system services are utilizing a substantial amount of CPU time, which might point to a system-level issue.
%iowait- percentage of time CPU spends waiting for I/O operations. An increased value in this might imply that the system is experiencing I/O bottlenecks or storage-related problems, resulting in the CPU waiting for I/O operations to complete.

iostat

You are experiencing slow disk performance, resulting in delayed read/write operations and increased latency for applications reliant on disk access.

Use iostat to monitor the I/O performance of the system's storage devices.

$ iostat -dx 5
Linux 5.15.0-88-generic (top-gerbil)    11/08/23        _x86_64_        (1 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
loop0            0.00      0.02     0.00   0.00    0.65     8.97    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop1            0.00      0.01     0.00   0.00    1.95    16.89    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop2            0.01      0.47     0.00   0.00    0.83    43.30    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     1.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sda              0.25     15.21     0.04  14.47    1.67    60.24    0.31     25.99     0.47  60.09    5.01    83.61    0.00      0.00     0.00   0.00    0.00     0.00    0.07    3.38    0.00   0.12
sr0              0.00      0.00     0.00   0.00    0.70     2.92    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

Parameters like Blk_read/s,Blk_wrtn/s,kB_read/s, kB_wrtn/s are used in maintaining record of reading and writing per second.
avgqu-sz- indicates the average number of requests made to the system. If it's greater than 1 then it can lead to saturation.

df

You received an alert that your disk partition is full and the system is becoming unresponsive.

Use df command to display information about the disk space usage of file systems. It provides an overview of available, used, and total disk space, as well as the mounted file systems.

$ df -kh
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            97M  1.2M   96M   2% /run
/dev/sda1       4.7G  2.3G  2.4G  50% /
tmpfs           482M     0  482M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15       98M  5.1M   93M   6% /boot/efi
tmpfs            97M  4.0K   97M   1% /run/user/1000

Filesystem- identifies which partition or filesystem is associated with the network. If a user is unable to save a file on a network share the user can take the help of this parameter.
Used- indicate how much space is currently in use, if it’s close to the storage capacity you need to free up space.

ifconfig

You need to troubleshoot network connectivity issues on a Linux server.

ifconfig command would show you all the configured network interfaces and ip address.

$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:96ff:fed3:d4d2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:96:d3:d4:d2  txqueuelen 0  (Ethernet)
        RX packets 4128  bytes 183296 (183.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6243  bytes 85480503 (85.4 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

inet and inet6 show us the ipv4 and ipv6 addresses that are attached to the interface for the connection.
bcast helps us to identify if there is a broadcasting-related issue like if there is an issue with the broadcast address that can lead to problems like network discovery.
mask (netmask) tells us about the subnet-related problems that can lead to communication issues between devices on different subnets.
mtu tells us about the maximum transmission unit i.e the maximum packet size that can be transferred in the transmission. If not it can lead to fragmentation and ultimately affecting the performance

dmesg

A Linux server is experiencing hardware issues, such as disk errors or network interface failures. You must analyze the system logs to identify any potential hardware-related errors or warnings.

The dmesg command provides information about hardware devices, system events, and potential issues encountered during system operation.

$ sudo dmesg
[    0.000000] Linux version 5.15.0-88-generic (buildd@lcy02-amd64-058) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 (Ubuntu 5.15.0-88.98-generic 5.15.126)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-88-generic root=UUID=5a569d86-b935-46dd-ae79-7a72a25b6a4c ro console=tty1 console=ttyS0
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003e1b6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000003e1b7000-0x000000003e1fffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000003e200000-0x000000003eceefff] [    0.000000] BIOS-e820: [mem 0x000000003f36b000-0x000000003ffeffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
[    0.000000] NX (Execute Disable) protection: active

By analyzing the logs, you can identify any hardware issues or error messages that could be affecting the server's performance and stability.

journalctl

This command displays the system call logs.

$ journalctl
Nov 01 17:15:42 ubuntu kernel: Linux version 5.15.0-87-generic (buildd@lcy02-amd64-011) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils>
Nov 01 17:15:42 ubuntu kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-87-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
Nov 01 17:15:42 ubuntu kernel: KERNEL supported cpus:
Nov 01 17:15:42 ubuntu kernel:   Intel GenuineIntel
Nov 01 17:15:42 ubuntu kernel:   AMD AuthenticAMD
Nov 01 17:15:42 ubuntu kernel:   Hygon HygonGenuine
Nov 01 17:15:42 ubuntu kernel: secureboot: Secure boot disabled
Nov 01 17:15:42 ubuntu kernel: SMBIOS 2.5 present.
Nov 01 17:15:42 ubuntu kernel: DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Nov 01 17:15:42 ubuntu kernel: Hypervisor detected: KVM

Suppose a service is failing to start on your Linux system, by using journalctl command you can get detailed log information about the service's attempts to start and any associated error messages.

nicstat

It offers comprehensive statistics on network interfaces, including data on failures, packets, and bandwidth usage.

$ nicstat
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
01:52:51       lo    0.00    0.00    0.01    0.01   93.01   93.01  0.00   0.00
01:52:51   enp0s3    4.72    0.05    3.33    0.29  1451.0   175.4  0.00   0.00
01:52:51  docker0    0.00    0.68    0.03    0.05   44.40 13692.2  0.00   0.00

time- tell us about the timestamp of each network statistics so if we encounter a network issue suddenly by the help of timestamp we can identify the patterns or potential triggers that led to the problem.
name- tell the name of each network interface we are using in a multi network environment so you can pinpoint the network which is causing the problem and resolve it.
kbps in, kbps out, pkt/s in, pkt/s out, err/s in, err/s out, drops/s in, drops/s out, missed/s in, missed/s out - indicates the information of data or packets transferred and received, how many packets are dropped and missed, how many damaged packets received and transmitted.
queue in, queue out tell us about the packet queuing like if the numbers are higher than usual this can lead to latency and we can say lag in the network.

lsof

A file is continuously growing in size which was not expected. You need to identify which process is writing into the file.

The lsof command gives a list of files that are opened.

lsof

ubuntu@top-gerbil:/$ sudo lsof -R
COMMAND    PID  TID TASKCMD   PPID       USER   FD      TYPE     DEVICE SIZE/OFF   NODE NAME
systemd      1                  0       root  cwd       DIR      8,1     4096       2    /
systemd      1                  0       root  rtd       DIR      8,1     4096       2    /
systemd      1                  0      root  txt       REG      8,1    1849992    3335 /usr/lib/systemd/system
container 3243                  1       root  txt       REG      8,1    52632728   39545 /usr/bin/containerd
container 3243                  1       root  mem-W     REG      8,1    32768      73792 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db

PID- This would give you the process ID associated with the files. You can then resolve the issue by monitoring the process.
USER- indicate who has accessed the files.

pstack

You have an application running on your Linux system that suddenly becomes unresponsive or experiences a segmentation fault. The issue might be related to the application’s call stack.

Use the pstack command along with the process ID of the running process or the core dump file generated during the crash.

$ pstack 432

Thread 1 (Thread 0x7f7f03600700 (LWP 6516)):
#0  0x00007f7f0576b9d5 in poll () from /lib64/libc.so.6
#1  0x00007f7f06f47b36 in ?? () from /usr/lib64/libglib-2.0.so.0
#2  0x00007f7f06f47c1a in g_main_context_iteration () from /usr/lib64/libglib-2.0.so.0
#3  0x00007f7f073d587d in ?? () from /usr/lib64/libgio-2.0.so.0
#4  0x00007f7f06f6f16d in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0
#5  0x00007f7f07471d7a in ?? () from /usr/lib64/libgio-2.0.so.0
#6  0x00007f7f06f1e82f in ?? () from /usr/lib64/libglib-2.0.so.0
#7  0x00007f7f0692fdd5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f7f0577703d in clone () from /lib64/libc.so.6

It provides you with the stack trace of the application, displaying the function calls and corresponding memory addresses at the time of the crash. By analyzing the stack trace, you can identify the specific function or module causing the issue and gain insights into the application's behavior leading up to the crash.

strace

It collects all system calls made by a process and the signals received by the process.

$ strace -p 5647

openat(AT_FDCWD, "/proc/self/mountinfo", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(3, "23 29 0:21 / /sys rw,nosuid,node"..., 1024) = 1024
read(3, "rmware/efi/efivars rw,nosuid,nod"..., 1024) = 1024
read(3, "re20/2015 ro,nodev,relatime shar"..., 1024) = 973
read(3, "", 1024)                       = 0
lseek(3, 0, SEEK_CUR)                   = 3021
close(3)                                = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
newfstatat(AT_FDCWD, "/run", {st_mode=S_IFDIR|0755, st_size=980, ...}, 0) = 0
newfstatat(AT_FDCWD, "/", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/sys/kernel/security", {st_mode=S_IFDIR|0755, st_size=0, ...}, 0)= 0

open system call - identifies problems related to accessibility of a file. By examining the file paths referenced in the open system calls, you can determine if there are any file-related issues contributing to the application's failure.
read and write system calls - offer information about data read from or written to specific files or resources. If there are any issues related to reading or writing data, these system calls can pinpoint where the problem lies, such as incorrect data handling or file manipulation.
errno- it displays the number of errors generated during the system calls. Analyze the type of error encountered such as file not found, permission denied, or invalid argument.

eBPF Performance tools

There's so much buzz in the market about BPF. Big companies like Meta and Amazon are using it. But what exactly is BPF? It stands for Berkeley Packet Filters, originally used for monitoring specific network traffic. But it's evolved into extended BPF that's like a magic wand for Linux, enabling tracing complex issues.

Brendan Gregg, a former Netflix member, shares his troubleshooting expertise in his book BPF Performance Tools, which is a must-read. He breaks down performance into different domains, offering practical examples of BPF in action using BCC and bpftrace.

Conclusion

In this article, we saw that it is imperative to troubleshoot using a methodological approach as it offers an organized and systematic means of locating issues and resolving them quickly and effectively. Without a systematic method, troubleshooting can become disorganized and time-consuming, which frequently results in frustration, trial-and-error fixes, and, in certain situations, worsening the problems.

Tired of sifting through convoluted outputs? I would highly recommend you to explore the alternative tools suggested by Julia Evans in her article A list of new(ish) command line tools to optimize your workflow. For instance,angle grinder outperforms traditional data analysis methods (grep), with precise and efficient results.

Check out the official documentary on the groundbreaking eBPF technology, highlighting its impact on the Linux Kernel and its journey of development with key industry players, including Meta, Intel, Isovalent, Google, Red Hat, and Netflix.

If you are stuck with the Linux issue and looking for SREs to troubleshoot, contact us for quick support.

Multi-tenancy in Kubernetes using Vcluster

Pavan Shiraguppi — Thu, 24 Aug 2023 09:40:18 +0000

Kubernetes has revolutionized how organizations deploy and manage containerized applications, making it easier to orchestrate and scale applications across clusters. However, running multiple heterogeneous workloads on a shared Kubernetes cluster comes with challenges like resource contention, security risks, lack of customization, and complex management.

There are several approaches to implementing isolation and multi-tenancy within Kubernetes:

Kubernetes namespaces: Namespaces allow some isolation by dividing cluster resources between different users. However, namespaces share the same physical infrastructure and kernel resources. So there are limits to isolation and customization.
Kubernetes distributions: Popular Kubernetes distributions like Red Hat OpenShift and Rancher support virtual clusters. These leverage Kubernetes-native capabilities like namespaces, RBAC, and network policies more efficiently. Other benefits include centralized control planes, pre-configured cluster templates, and easy-to-use management.
Hierarchical namespaces: In a traditional Kubernetes cluster, each namespace is independent of the others. This means that users and applications in one namespace cannot access resources in another namespace unless they have explicit permissions. Hierarchical namespaces solve this problem by allowing you to define a parent-child relationship between namespaces. This means that a user or application with permissions in the parent namespace will automatically have permissions in all of the child namespaces. This makes it much easier to manage permissions across multiple namespaces.
Vcluster project: The virtual cluster (vcluster) project addresses these pain points by dividing a physical Kubernetes cluster into multiple isolated software-defined clusters. vcluster allows organizations to provide development teams, applications, and customers with dedicated Kubernetes environments with guaranteed resources, security policies, and custom configurations. This post will dive deep into vcluster - its capabilities, different implementation options, use cases, and challenges. We will also look into the best practices for maximizing utilization and simplifying the management of vcluster.

What is Vcluster?

vcluster is an open-source tool that allows you to create and manage virtual Kubernetes clusters. A virtual Kubernetes cluster is a fully functional Kubernetes cluster that runs on top of another Kubernetes cluster. vcluster works by creating a virtual cluster inside a namespace of the underlying Kubernetes cluster. The virtual cluster has its own control plane, but it shares the worker nodes and networking of the underlying cluster. This makes vcluster a lightweight solution that can be deployed on any Kubernetes cluster.

When you create a vcluster, you specify the number of worker nodes that you want the virtual cluster to have. The vcluster CLI will then create the virtual cluster and start the control plane pods on the worker nodes. You can then deploy workloads to the virtual cluster using the kubectl CLI.

You can learn more about vcluster on the vcluster website.

Benefits of Using Vcluster

Resource Isolation

vcluster allows you to allocate a portion of the central cluster's resources like CPU, memory, and storage to individual virtual clusters. This prevents noisy neighbor issues when multiple teams share the same physical cluster. Critical workloads can be assured of the resources they need without interference.

Access Control

With vcluster, access policies can be implemented at the virtual cluster level, ensuring only authorized users have access. For example, sensitive workloads like financial applications can run in an isolated vcluster. Restricting access is much simpler compared to namespace-level policies.

Source: Basics | vcluster docs | Virtual Clusters for
Kubernetes

Customization

vcluster allows extensive customization for individual teams' needs - different Kubernetes versions, network policies, ingress rules, and resource quotas can be defined. Developers can have permission to modify their vcluster without impacting others.

Multitenancy

Organizations often need to provide Kubernetes access to multiple internal teams or external customers. vcluster makes multi-tenancy easy to implement by creating separate isolated environments in the same physical cluster. Refer to this article for more information.

Source: Implementing Virtual Kubernetes Clusters | Rafay

Easy Scaling

Additional vcluster can be quickly spun up or down to handle dynamic workloads and scale requirements. New development and testing environments can be provisioned instantly without having to scale the entire physical cluster.

Workload Isolation Approaches Before vcluster

Organizations have leveraged various Kubernetes native features to enable some workload isolation before virtual clusters emerged as a solution:

Namespaces - Namespaces segregate cluster resources between different teams or applications. They provide basic isolation via resource quotas and network policies. However, there is no hypervisor-level isolation.
Network Policies - Granular network policies restrict communication between pods and namespaces. This creates network segmentation between workloads. However, resource contention can still occur.
Taints and Tolerations - Applying taints to nodes prevents specified pods from scheduling onto them. Pods must have tolerances to match taints. This enables restricting pods to certain nodes.
Cloud Virtual Networks - On public clouds, using multiple virtual networks helps isolate Kubernetes cluster traffic. But pods within a cluster can still communicate.
Third-Party Network Plugins - CNI plugins like Calico, Weave, and Cilium enable building overlay networks and fine-grained network policies to segregate traffic.
Custom Controllers - Developing custom Kubernetes controllers allows programmatically isolating resources. But this requires significant programming expertise.

Demo of vcluster

Install vcluster CLI

Requirements:

kubectl (check via kubectl version)
helm v3 (check with helm version)
a working kube-context with access to a Kubernetes cluster (check with kubectl get namespaces)

Use the following command to download the vcluster CLI binary for arm64-based Ubuntu machines:

curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-arm64" && sudo install -c -m 0755 vcluster /usr/local/bin && rm -f vcluster

To confirm that vcluster CLI is successfully installed, test via:

vcluster --version

For installations on other machines, please refer to the following link.
Install vcluster CLI

Deploy vcluster

Let's create a virtual cluster my-first-vcluster

vcluster create my-first-vcluster

Connection to the vcluster

To connect to the vcluster enter the following command:

vcluster connect my-first-vcluster

Use kubectl command to get the namespaces in the connected vcluster.

kubectl get namespaces

Deploy an application to the vcluster

Now let's deploy a sample nginx deployment inside the vcluster. To create a deployment:

kubectl create namespace demo-nginx
kubectl create deployment nginx-deployment -n demo-nginx --image=nginx

This will isolate the application in a namespace demo-nginx inside the vcluster.

You can check that this demo deployment will create pods inside the vcluster:

kubectl get pods -n demo-nginx

Check deployments from the host cluster

Now that we have confirmed the deployments in the vcluster, let us now try to check the deployments from the host cluster.

To disconnect from the vcluster:

vcluster disconnect

This will move the kube context back to the host cluster. Now let us check if there are any deployments available in the host cluster.

kubectl get deployments -n vcluster-my-first-vcluster

There will be no resources found in the vcluster-my-vcluster namespace. This is because the deployment is isolated in the vcluster that is not accessible from other clusters.

Now let us check if any pods are running in all of the namespaces using the following command.

kubectl get pods -n vcluster-my-first-vcluster

Voila! We can now see that the nginx container is running in the vcluster namespace.

Vcluster Use Cases

Virtual clusters enable several important use cases by providing isolated and customizable Kubernetes environments within a single physical cluster. Let's explore some of these in more detail:

Development and Testing Environments

Allocating dedicated virtual clusters for developer teams allows them to fully control the configuration without affecting production workloads or other developers.
Teams can customize their vclusters with required Kubernetes versions, network policies, resource quotas, and access controls. Development teams can rapidly spin up and tear down vclusters to test different configurations.
Since vclusters provide guaranteed compute and storage resources, developers don't have to compete. They also won't impact the performance of applications running in other vclusters.

Production Application Isolation

Enterprise applications like ERP, CRM, and financial systems require predictable performance, high availability, and strict security. Dedicated vclusters allow these production workloads to operate unaffected by other applications.
Mission-critical applications can be allocated reserved capacity to avoid resource contention. Custom network policies guarantee isolation. Vclusters also allow granular role-based access control to meet regulatory compliance needs.
Rather than overprovisioning large clusters to avoid interference, vclusters provide guaranteed resources at a lower cost.

Multitenancy

Service providers and enterprises with multiple business units often need to securely provide Kubernetes access to different internal teams or external customers.
vclusters simplify multi-tenancy by creating separate self-service environments for each tenant with appropriate resource limits and access policies applied. Providers can easily onboard new customers by spinning up additional vclusters.
This removes noisy neighbor issues and allows a high density of workloads by packing vclusters according to actual usage rather than peak needs.

Regulatory Compliance

Heavily regulated industries like finance and healthcare have strict security and compliance requirements around data privacy, geography, and access controls.
Dedicated vclusters with internal network segmentation, role-based access control, and resource isolation make it easier to host compliant workloads safely alongside other applications in the same cluster.

Temporary Resources

vclusters allow instantly spinning up temporary Kubernetes environments to handle use cases like

Testing cluster upgrades - New Kubernetes versions can be deployed to lower environments with no downtime or impact on production.
Evaluating new applications - Applications can be deployed into disposable vclusters instead of shared dev clusters to prevent conflicts.
Capacity spikes - New vclusters provide burst capacity for traffic spikes versus overprovisioning the entire cluster.
Special events - vClusters can be created temporarily for workshops, conferences, and other events.

Once the need is over, these vclusters can simply be deleted with no lasting footprint on the cluster.

Workload Consolidation

As organizations scale their Kubernetes footprint, there is a need to consolidate multiple clusters onto shared infrastructure without interfering with existing applications.
Migrating applications into vclusters provides logical isolation and customization allowing them to run seamlessly alongside other workloads. This improves utilization and reduces operational overhead.
vclusters allow enterprise IT to provide a consistent Kubernetes platform across the organization while preserving isolation.
In summary, vclusters are an essential tool for optimizing Kubernetes environments via workload isolation, customization, security, and density. The use cases highlight how they benefit diverse needs from developers to Ops to business units within an organization.

Challenges with vclusters

While delivering significant benefits, some downsides to weigh includes:

Complexity

Managing multiple virtual clusters, albeit smaller ones, introduces more operational overhead compared to a single large Kubernetes cluster.
Additional tasks include:

Provisioning and configuring multiple control planes
Applying security policies and access controls consistently across vclusters
Monitoring and logging across vclusters
Maintaining designated resources and capacity for each vcluster

For example, a cluster administrator has to configure and update RBAC policies across 20 vclusters rather than a single cluster. This takes more effort compared to the centralized management of a single cluster. The static IP addresses and ports on Kubernetes might cause conflicts or errors.

Resource allocation and management

Balancing the resource consumption and performance of vclusters can be tricky, as they may have different demands or expectations.

For example, vclusters may need to scale up or down depending on the workload or share resources with other vclusters or namespaces. A vcluster sized for an application's peak demand may have excess unused capacity during non-peak periods that sits idle and cannot be leveraged by other vclusters.

Limited Customization

The ability to customize vclusters varies across implementations. Namespaces offer the least flexibility, while Cluster API provides the most. Tools like OpenShift balance customization with simplicity.
For example, namespaces cannot run different Kubernetes versions or network plugins. The Cluster API allows full customization but with more complexity.

Conclusion

Vcluster empowers Kubernetes users to customize, isolate and scale workloads within a shared physical cluster. By allocating dedicated control plane resources and access policies, vclusters provide strong technical isolation. For use cases like multitenancy, vclusters deliver simplified and more secure Kubernetes management.

Vcluster can also be used to reduce Kubernetes cost overhead and can be used for ephemeral environments.
Tools like OpenShift, Rancher, and Kubernetes Cluster API make deploying and managing vclusters much easier. As adoption increases, we can expect more innovations in the vcluster space to further simplify operations and maximize utilization. While vclusters have some drawbacks, for many organizations the benefits outweigh the added complexity.

We are working on some exciting projects using vcluster to build a large scale system. Feel free to contact us to discuss how to use vcluster for your usecase.

Deploy LLM on Kubernetes using OpenLLM

Pavan Shiraguppi — Wed, 16 Aug 2023 06:32:17 +0000

Introduction

Natural Language Processing (NLP) has evolved significantly, with Large Language Models (LLMs) at the forefront of cutting-edge applications. Their ability to understand and generate human-like text has revolutionized various industries. Deploying and testing these LLMs effectively is crucial for harnessing their capabilities.

OpenLLM is an open-source platform for operating large language models (LLMs) in production. It allows you to run inference on any open-source LLMs, fine-tune them, deploy, and build powerful AI apps with ease.

This blog post explores the deployment of LLM models using the OpenLLM framework on a Kubernetes infrastructure. For the purpose of the demo, I am using a hardware setup consisting of an RTX 3060 GPU and an Intel i7 12700K processor, we delve into the technical aspects of achieving optimal performance.

Environment Setup and Kubernetes Configuration

Before diving into LLM deployment on Kubernetes, we need to ensure the environment is set up correctly and the Kubernetes cluster is ready for action.

Preparing the Kubernetes Cluster

Setting up a Kubernetes cluster requires defining worker nodes, networking, and orchestrators. Ensure you have Kubernetes installed and a cluster configured. This can be achieved through tools like kubeadm, minikube, kind or managed services such as Google Kubernetes Engine (GKE) and Amazon EKS.

If you are using kind cluster, you can create cluster as following:

kind create cluster

Installing Dependencies and Resources

Within the cluster, install essential dependencies such as NVIDIA GPU drivers, CUDA libraries, and Kubernetes GPU support. These components are crucial for enabling GPU acceleration and maximizing LLM performance.

To use CUDA on your system, you will need the following installed:

A CUDA-capable GPU
A supported version of Linux with a gcc compiler and toolchain
CUDA Toolkit 12.2 at NVIDIA Developer portal

Using OpenLLM to Containerize and Load Models

OpenLLM

OpenLLM supports a wide range of state-of-the-art LLMs, including Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. It also provides flexible APIs that allow you to serve LLMs over RESTful API or gRPC with one command, or query via WebUI, CLI, our Python/Javascript client, or any HTTP client.

Some of the key features of OpenLLM:

Support for a wide range of state-of-the-art LLMs
Flexible APIs for serving LLMs
Integration with other powerful tools
Easy to use
Open-source

To use OpenLLM, you need to have Python 3.8 (or newer) and pip installed on your system. We highly recommend using a Virtual Environment (like conda) to prevent package conflicts.

You can install OpenLLM using pip as follows:

pip install openllm

To verify if it's installed correctly, run:

openllm -h

To start an LLM server, for example, to start an Open Pre-trained transformer model aka OPT server, do the following:

openllm start opt

Selecting the LLM Model

OpenLLM framework supports various pre-trained LLM models like GPT-3, GPT-2, and BERT. When selecting a large language model (LLM) for your application, the main factors to consider are:

Model size - Larger models like GPT-3 have more parameters and can handle more complex tasks, while smaller ones like GPT-2 are better for simpler usecases.
Architecture - Models optimized for generative AI like GPT-3 or understanding (e.g. BERT) align with different use cases.
Training data - More high-quality, diverse data leads to better generalization capabilities.
Fine-tuning - Pre-trained models can be further trained on domain-specific data to improve performance.
Alignment with usecase- Validate potential models on your specific application and data to ensure the right balance of complexity and capability.

The ideal LLM matches your needs in terms of complexity, data requirements, compute resources, and overall capability. Thoroughly evaluate options to select the best fit. For this demo, we will be using the Dolly-2 model with 3B parameters.

Loading the Chosen Model within a Container

Containerization enhances reproducibility and portability. Package your LLM model, OpenLLM dependencies, and other relevant libraries within a Docker container. This ensures a consistent runtime environment across different deployments.

With OpenLLM, you can easily build a Bento for a specific model, like dolly-v2-3b, using the build command.

openllm build dolly-v2 --model-id databricks/dolly-v2-3b

In this demo, we are using BentoML, an MLOps platform and also the parent organization behind OpenLLM project. A Bento, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artifacts, and dependencies.

To Containerize your Bento, run the following command:

bentoml containerize <name:version> -t dolly-v2-3b:latest --opt progress=plain

This generates an OCI-compatible docker image that can be deployed anywhere docker runs.

You will be able to locate the docker image in $BENTO_HOME\bentos\stabilityai-stablelm-tuned-alpha-3b-service\$id\env\docker.

Model Inference and High Scalability using Kubernetes

Executing model inference efficiently and scaling up when needed are key factors in a Kubernetes-based LLM deployment. The reliability and scalability features of Kubernetes can help efficiently scale the model for the production usecase.

Running LLM Model Inference

Pod Communication: Set up communication protocols within pods to manage model input and output. This can involve RESTful APIs or gRPC-based communication.

OpenLLM has a gRPC server running by default on port 3000. We can have a deployment file as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dolly-v2-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: dolly-v2
  template:
    metadata:
      labels:
        app: dolly-v2
    spec:
      containers:
        - name: dolly-v2
          image: dolly-v2-3b:latest
          imagePullPolicy: Never
          ports:
            - containerPort: 3000

Note: We will be assuming that the image is available locally with the name dolly-v2-3b with the latest tag. If the image is pushed to the repository, then make sure to remove the imagePullPolicy line and provide the credentials to the repository as secrets if it is a private repository.

Service: Expose the deployment using services to distribute incoming inference requests evenly among multiple pods.

We set up a LoadBalancer type service in our Kubernetes cluster that gets exposed on port 80. If you are using Ingress then it will be ClusterIP instead of LoadBalancer.

apiVersion: v1
kind: Service
metadata:
  name: dolly-v2-service
spec:
  type: LoadBalancer
  selector:
    app: dolly-v2
  ports:
    - name: http
      port: 80
      targetPort: 3000

Horizontal Scaling and Autoscaling

Horizontal Pod Autoscaling (HPA): Configure HPAs to automatically adjust the number of pods based on CPU or custom metrics. This ensures optimal resource utilization.

We can declare an HPA yaml for CPU configuration as below:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: dolly-v2-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dolly-v2-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60

For GPU configuration, To gather GPU metrics in Kubernetes, follow this blog to install the DCGM server: Kubernetes HPA using GPU metrics.

After installation of the DCGM server, we can use the following to create HPA for GPU memory:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: dolly-v2-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: dolly-v2-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Object
      object:
        target:
          kind: Service
          name: dolly-v2-deployment # kubectl get svc | grep dcgm
        metricName: DCGM_FI_DEV_MEM_COPY_UTIL
        targetValue: 80

Cluster Autoscaling: Enable cluster-level autoscaling to manage resource availability across multiple nodes, accommodating varying workloads. Here are the key steps to configure cluster autoscaling in Kubernetes:

Install the Cluster Autoscaler plugin:

kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/v1.20.0/cluster-autoscaler-component.yaml

Configure auto scaling by setting min/max nodes in your cluster config.
Annotate node groups you want to scale automatically:

kubectl annotate node POOL_NAME cluster-autoscaler.kubernetes.io/safe-to-evict=true

Deploy an auto scaling-enabled application, like an HPA-based deployment. The autoscaler will scale the node pool when pods are unschedulable.
Configure auto scaling parameters as needed:
- Adjust scale-up/down delays with --scale-down-delay
- Set scale-down unneeded time with --scale-down-unneeded-time
- Limit scale speed with --max-node-provision-time
Monitor your cluster autoscaling events:

kubectl get events | grep ClusterAutoscaler

Performance Analysis of LLMs in a Kubernetes Environment

Evaluating the performance of LLM deployment within a Kubernetes environment involves latency measurement and resource utilization assessment.

Latency Evaluation

Measuring Latency: Use tools like kubectl exec or custom scripts to measure the time it takes for a pod to process an input prompt and generate a response. Refer the below python script to determine latency metrics of the GPU.

Python Program to test Latency and Tokens/sec.

import torch
from transformers import AutoModelForCausalLM

model_name = "databricks/dolly-v2-3b"
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
text = "Sample text for benchmarking"
input_ids = model.tokenizer(text, return_tensors="pt").input_ids.cuda()
reps =100
times = []

for i in range(reps):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    # Start timer
    start.record()
    # Model inference
    outputs = model(input_ids).logits
    # End timer
    end.record()
    # Sync and get time
    torch.cuda.synchronize()
    times.append(start.elapsed_time(end))

# Calculate TPS
tokens = len(text.split())
tps = (tokens * reps) / sum(times)
# Calculate latency
latency = sum(times) / reps * 1000 # in ms
print(f"Avg TPS: {tps:.2f}")
print(f"Avg Latency: {latency:.2f} ms")

Comparing Latency using Aviary: Aviary is a valuable tool for developers who want to get started with LLMs, or who want to improve the performance and scalability of their LLM-based applications. It is easy to use and provides a number of features that make it a great choice for both beginners and experienced developers.

Resource Utilization and Scalability Insights

Monitoring Resource Consumption: Utilize Kubernetes dashboard or monitoring tools like Prometheus and Grafana to observe resource usage patterns across pods.
Scalability Analysis: Analyze how Kubernetes dynamically adjusts resources based on demand, ensuring resource efficiency and application responsiveness.

Conclusion

We have tried to put up an in-depth technical analysis that demonstrates the immense value of leveraging Kubernetes for LLM deployments. By combining GPU acceleration, specialized libraries, and Kubernetes orchestration capabilities, LLMs can be deployed with significantly improved performance and for a large scale. In particular, GPU-enabled pods achieved over 2x lower latency and nearly double the inference throughput compared to CPU-only variants. Kubernetes autoscaling also allowed pods to be scaled horizontally on demand, so query volumes could increase without compromising responsiveness.

Overall, the results of this analysis validate that Kubernetes is the best choice for deploying LLMs at scale. The synergy between software and hardware optimization on Kubernetes unlocks the true potential of LLMs for real-world NLP use cases.

If you are looking for help implementing LLMs on Kubernetes, we would love to hear how you are scaling LLMs. Please contact us to discuss your specific problem statement.

Running Containers in Azure

Asmi-KR — Tue, 11 Jul 2023 07:44:04 +0000

Microservice is an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. It is difficult to talk about microservices without talking about containers. These services are containerized and deployed on a container platform such as Docker.

Before exploring various services provided by Microsoft Azure, let’s quickly learn about the container. A container image has the software and its dependencies packaged into an immutable artifact. Each change in the Image forms a layer. Containerization helps developers to create and deploy applications faster and more securely.

Microsoft Azure provides various services to run the containers in their cloud computing platform. In this article, you will learn about some of the services available in Azure for container deployment.

Azure Container Instances (ACI)

Azure Container Instances (ACI) is a service to operate containers in an
isolated environment, without worrying about its orchestration. Some of the use
cases of ACI are data processing, event-driven applications, running containers
for immediate use with minimal effort, short-lived batch jobs, and running
development or test environments.

ACI provides excellent flexibility, allowing you to deploy individual containers or multi-container applications. It can be considered as a low-level "building block" option compared to Container Apps. Advanced features like autoscaling, load balancing, and automatic certificates are not provided by ACI.

Azure Kubernetes Service (AKS)

Azure Kubernetes Service (AKS) is a fully managed container orchestration service, built on the
popular Kubernetes platform. AKS provides a managed control plane, reducing the
operational overhead in running the Kubernetes cluster. AKS simplifies the
deployment, management, and scaling of containerized applications. It provides
features like automated scaling, load-balancing, and self-healing capabilities
including all features of upstream Kubernetes. AKS is suitable for complex,
production-grade applications that require high availability, scalability, and
control. Additionally, AKS integrates natively with other Azure services like
Azure DevOps integration, Azure Active Directory authentication, and Azure
Monitor.

Azure Service Fabric

Azure Service Fabric is Microsoft's distributed Platform-as-a-Service (PaaS) used to build and deploy microservices-based cloud applications. It supports both containerized and non-containerized workloads. With Service Fabric, you can deploy containers to a managed cluster and take advantage of its robust scalability, high availability, and automatic scaling features.

Service Fabric addresses the significant challenges in developing and managing cloud apps. Service Fabric can deploy Docker and Windows Server containers. It also supports arbitrary executables and direct, code-level integrations as stateful services that can run alongside containerized services. Users can integrate with Azure Pipelines, DevOps Services, Monitor, and Key Vault within the scope of a Service Fabric application. Service Fabric is a good choice for running applications with complex inter-service communication and stateful requirements.

Azure Functions with Containers

Azure Functions allows you to run serverless functions deployed as containers. It combines the flexibility of containers with the event-driven, pay-per-execution model of Azure Functions. With this option, you can build and deploy serverless applications packaged as containers.

Azure Functions with containers offers seamless integration with other Azure services, event sources, and triggers. It is suitable for scenarios where you want to leverage serverless capabilities while maintaining the control and portability of containerized applications. When you create a Functions project using Azure Functions Core Tools and include the --docker option, core tools also generates a Dockerfile that you can use to create your container from the correct base image.

Azure App Service

App Service is a Platform as a Service (PaaS) offering from Microsoft. Typically it is used to host HTTP-based web applications, REST APIs, and backend services for mobile applications. You can write these applications in your favorite language, be it .NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. It includes automatic scaling, continuous deployment, and built-in support for popular programming languages and frameworks.

It enables developers to focus on creating outstanding applications rather than worrying about infrastructure administration and also offers a comprehensive range of tools for developing, deploying, and monitoring apps, as well as integration with Azure DevOps and other popular DevOps tools. It provides a simple and intuitive deployment experience and is suitable for lightweight, web-focused containerized applications.

How to decide which one is a good fit?

The Azure team has shared this decision tree that can be helpful to identify which service is right for your use case.

Credits: Microsoft documentation

Service Comparison

I have tried to sum up the use cases and benefits in the below table.

Azure Services	Use cases	Benefits
Azure Container Instances (ACI)	Running containers instantly, batch jobs	No infrastructure managment, quick and easy deployment
Azure Kubernetes Service (AKS)	High availability and scalability, critical services	Higher reliability and control, fully managed orchestration, more control
Azure Service Fabric	Good for microservices based application, app has complex inter-service communication	Robust scalability, high availability, and built support for state management
Azure functions with Containers	Serverless and event-driven applications	serverless execution and seamless integration with other services, low cost
Azure App Services	Web applications, light-weight services	fully managed platform, automatic scaling and load balancing, seamless integration with other Azure services

Conclusion

Azure offers a comprehensive set of options for deploying containerized applications, catering to a wide range of scenarios and requirements. Whether you prefer serverless execution, container orchestration, microservices architecture, or a combination of these approaches, Azure has a solution. By leveraging these services you can take advantage of Azure's scalability, reliability, and integration capabilities to deploy and manage your containerized applications with ease. Choose the option that best aligns with your application's needs and start realizing the benefits of containerization in Azure.
If you need additional help in understanding the right choice, please don't hesitate to contact us.

About the Guest Author

Smita, previously an IT Trainer, dedicated numerous years to assisting individuals and organizations in gaining knowledge about diverse technologies and software development methodologies. Currently, her growing fascination lies in the realm of DevOps, prompting her to delve deeper into research within this field. Smita possesses a profound passion for writing and takes pleasure in disseminating the knowledge she acquires along her journey.

Secure Coding Best Practices

Anjul Sahu — Sat, 17 Jun 2023 13:19:12 +0000

Every single day, an extensive array of fresh software vulnerabilities is unearthed by diligent security researchers and analysts. A considerable portion of these vulnerabilities emerges due to the absence of secure coding practices. Exploiting such vulnerabilities can have severe consequences, as they possess the potential to severely impair the financial or physical assets of a business, erode trust, or disrupt critical services.

For organisations reliant on their software for their operations, it becomes imperative for software developers to embrace secure coding practices. Secure coding entails a collection of practices that software developers adopt to fortify their code against cyberattacks and vulnerabilities. By adhering to coding standards that embody best practices, developers can incorporate safeguards that minimise the risks posed by vulnerabilities in their code.

In a world brimming with cyber threats, secure coding cannot be viewed as optional if a business intends to maintain its shield of protection.

This article, we will explore some anti-patterns and best practices we can include in our workflow.

Anti-patterns

Now, let's briefly discuss various common mistakes or anti-patterns, categorised into insecure coding. The following are some examples:

Insufficient validation of input data or processing inputs without proper encoding or sanitisation.
Constructing SQL queries by concatenating strings, making the code vulnerable to data leaks or injection attacks.
Failure to implement robust authentication, such as storing credentials in plain text without proper hashing and encryption.
Poor design of password recovery mechanisms and infrequent rotation of security keys.
Software planning and design lacking strong authorisation schemes.
Granting excessive privileges during development or troubleshooting.
Exposing sensitive information in debug logging without appropriate redaction.
Utilising third-party libraries from untrusted sources or neglecting security checks.
Unsafe handling of memory pointers or allowing pointer access beyond system boundaries.

With these common mistakes in mind, let's explore practices and tools that can guide developers towards secure coding practices.

Secure Coding Best Practices

Shift left in software development lifecycle

Historically, the conventional practice involved assigning the software security team to conduct security testing towards the conclusion of a software development project. The team would assess the application and compile a list of issues that require resolution. At this stage, the identified fixes would be prioritised, resulting in some vulnerabilities being addressed while others remained unattended. The reasons for leaving certain vulnerabilities unresolved could range from cost constraints and limited resources to pressing business priorities.

However, this traditional approach is no longer sustainable. Security considerations must now be incorporated right from the outset—the initial stages—of the software development lifecycle. Security should be taken into account during the design phase itself. Both manual and automated testing should be conducted throughout the application's implementation as part of the Continuous Integration (CI) pipeline, ensuring that developers receive prompt feedback.

To aid in this endeavour, the utilisation of static code analysis becomes invaluable. This technique enables the scanning of code for security flaws and risks, even while developers are actively writing it within an integrated development environment (IDE). For instance, SAST tools offers the ability to analyse the code for security vulnerabilities during the development process, facilitating early identification and mitigation of potential risks.

Input validation

Ensuring the integrity of input data as it enters a system holds great significance. It is essential to validate the syntactic and semantic accuracy of all incoming data, considering it as untrusted. Employing checks and regular expressions aids in verifying the correctness, size, and syntax of the input.

Performing these validations on the server side is highly recommended. In the case of web applications, it involves scrutinising various components, including HTTP headers, cookies, GET and POST parameters, as well as file uploads.

Client-side validation also proves beneficial, contributing to an enhanced user experience by reducing the need for multiple network requests resulting from invalid inputs. This approach minimises back-and-forth communication and enhances efficiency.

Parameterised queries

During the process of storing and retrieving data, developers frequently engage with datastores. However, if they overlook the utilisation of parametrised queries, it can expose an opportunity for attackers to exploit widely accessible tools and manipulate inputs to extract sensitive information. SQL injection, a highly perilous application risk, exemplifies a common form of such attacks.

By incorporating placeholders for parameters within the query, the specified parameters are treated as data rather than being considered as part of the SQL command itself. To mitigate these vulnerabilities, it is recommended to employ prepared statements or object-relational mapping (ORM) techniques. These approaches offer effective measures to safeguard against SQL injection and related threats.

Encoding data

Encoding data plays a vital role in mitigating threats by transforming potentially hazardous special characters into a sanitised format. Base64 encoding serves as an exemplar of such encoding techniques, offering protection against SQL injection, cross-site scripting (XSS), and client-side injection attacks.

To enhance security, it is crucial to specify appropriate character sets, such as UTF-8, and encode data into a standardised character set before further processing. Additionally, employing canonicalisation techniques proves beneficial. For instance, simplifying characters to their basic form helps address issues such as double encoding and obfuscation attacks, thereby bolstering overall security measures.

Implement identity and authentication controls

To further enhance security and minimise the risk of breaches, secure coding practices emphasise the importance of verifying a user's identity at the outset and integrating robust authentication controls into the application's code.

Here are some recommended measures to achieve this:

Employ strong authentication methods, such as multi-factor authentication, to add an additional layer of security.
Consider incorporating biometric authentication methods, such as fingerprint or facial recognition, especially in mobile applications.
Ensure secure storage of passwords. Typically, this involves hashing the password using a strong hashing function and securely storing the encrypted hash in a database.
Implement a secure password recovery mechanism to facilitate password resets while maintaining security.
Enable session timeouts and inactivity periods to automatically terminate idle sessions.
For sensitive operations like modifying account information, enforce re-authentication to validate the user's identity.
Conduct regular audits of authentication transactions to detect any suspicious activities and maintain a vigilant stance against potential threats.

Implement access controls

Incorporating a well-thought-out authorisation strategy during the initial stages of application development can greatly enhance the overall security posture. Authorisation entails determining the specific resources that an authenticated user can or cannot access.

Consider the following guidelines to strengthen the authorisation framework:

Establish a sequential flow of authentication followed by authorisation. Implement a mechanism where all requests undergo access control checks.
Adhere to the principle of least privilege, initially denying access to any resource that has not been explicitly configured for access control.
Enforce time-based limitations on user or system component actions by implementing expiration times, thereby ensuring that actions have defined timeframes for execution.

By following these practices, developers can create a robust and effective authorisation system that bolsters the overall security of the application.

Protect sensitive data

In order to comply with legal and regulatory obligations, it is the responsibility of businesses to safeguard customer data. This sensitive data encompasses various categories, including:

Personally identifiable information (PII)
Financial transactions
Health records
Web browser data
Mobile data etc

To prevent data leakage, it is crucial to employ robust encryption methods for both data at rest and data in transit. Consider the following practices to enhance data protection:

Utilise a well-established, peer-reviewed cryptographic library and functions that have been vetted and approved by your security team.
Avoid storing encryption keys alongside the encrypted data to prevent unauthorised access.
Refrain from storing confidential or sensitive data in memory, temporary locations, or log files during processing.
Implement redaction technique in log forwarders to remove sensitive information.
Implement mandatory re-authentication when accessing sensitive data within the application.

Implement logging and intrusion detection

Even the most meticulously designed system can be susceptible to exploitation by attackers. Therefore, it is advisable to incorporate a monitoring system that can detect and identify unusual events. It is crucial to ensure that sufficient information is logged concerning authentication, authorisation, and resource access events. This logging should include details such as timestamps, the origin of access requests, IP addresses, and information pertaining to the requested resource. It is important to store this information in a secure and protected log. Typically, these logs are transmitted in real time to a centralised system where they are analysed for any anomalies. Prior to logging, apply encoding techniques to the untrusted data to safeguard against log injection attacks.

In the event of a security breach, it is essential to have a well-documented playbook in place to promptly terminate system access, mitigating the risk of further data leakage. By following these practices, organisations can enhance their ability to detect and respond to potential intrusions, minimising the impact of security incidents.

Leverage security frameworks and libraries

Avoid unnecessary duplication of effort. Instead, leverage established security frameworks and libraries that have been proven effective. When incorporating such components into your project, ensure they are sourced from reliable and trusted third-party repositories. It is important to regularly assess these libraries for any vulnerabilities or weaknesses and proactively keep them up to date.

By adopting this approach, you can benefit from the expertise and experience embedded in these established security solutions, saving valuable time and effort while maintaining a strong security posture.

Monitor error and exception handling

In line with the best practices of logging, it is advisable to adopt a centralised approach for handling and monitoring errors and exceptions with tools like Sentry. Effective management of errors and exceptions is crucial as mishandling them can inadvertently expose valuable information to potential attackers, enabling them to gain insights into your application and platform design.

Consider the following measures to strengthen error and exception handling:

Avoid logging sensitive information within error messages to prevent inadvertent disclosure.
Regularly conduct code reviews to identify and address any weaknesses or vulnerabilities in the error handling implementation.
Utilise negative testing techniques, such as exploratory and penetration testing, fuzzing, and fault injection, to actively identify and rectify potential issues related to error handling.

By implementing these practices, you can ensure that error and exception handling is performed securely and with minimal risk of exposing sensitive information to potential attackers.

Benefits of implementing secure coding practices

At this point, the advantages of embracing secure coding practices should be evident:

Incorporating automated checks and code analysis during the development process enhances developer productivity by promptly providing feedback to improve code security. This leads to quicker time-to-market and higher-quality code.
Cost optimisation within the software development lifecycle is achieved by minimising bugs at the early stages.
Static application security testing (SAST) tools offer developers of all skill levels guardrails, AppSec governance, and valuable insights through IDE plugins. These tools equip developers with the necessary knowledge and resources to bolster application security.

Conclusion

Throughout our examination of coding flaws that can result in vulnerabilities, we have also explored best practices to enhance the security stance of software. However, in the context of large-scale projects, it can be daunting to implement these practices while ensuring proper governance.

In the realm of extensive projects, the following considerations can help navigate these challenges effectively:

Establish clear governance frameworks that outline security requirements, procedures, and responsibilities.
Develop comprehensive guidelines and standards that align with secure coding practices and provide actionable steps for implementation.
Foster collaboration and communication among development teams, security experts, and stakeholders to ensure a shared understanding of security goals and the necessary measures to achieve them.
Prioritise the implementation of security measures by identifying high-risk areas and focusing resources accordingly.
Regularly assess and review the security posture of the software throughout the development lifecycle, enabling continuous improvement and adjustments as needed.

By adopting these approaches, the process of implementing secure coding practices within large projects becomes more manageable and ensures that proper governance is in place to safeguard against vulnerabilities effectively.

It is advisable to create and automate workflows using SAST tools and integrate in CI to enforce the best practices. Feel free to schedule a non-obligatory call with us to discuss DevSecOps strategy and we can help you improve your current practice.