<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: CloudRaft</title>
    <description>The latest articles on Forem by CloudRaft (@cloudraft).</description>
    <link>https://forem.com/cloudraft</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F7106%2F903d31b8-0b9d-470a-ae32-fc539e2ca218.png</url>
      <title>Forem: CloudRaft</title>
      <link>https://forem.com/cloudraft</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/cloudraft"/>
    <language>en</language>
    <item>
      <title>Heroku to Kubernetes Migration: Clock is ticking</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Wed, 11 Feb 2026 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/heroku-to-kubernetes-migration-clock-is-ticking-13g7</link>
      <guid>https://forem.com/cloudraft/heroku-to-kubernetes-migration-clock-is-ticking-13g7</guid>
      <description>&lt;p&gt;For years, Heroku has been a beloved starting point for countless high-growth companies. It was revolutionary, making the deployment of an idea almost trivial. That focus on the developer experience—on simply pushing code and having it run—is why so many successful Minimum Viable Products (MVPs) and early-stage platforms were born there. It allowed engineering leadership to focus on product-market fit (PMF) instead of infrastructure.&lt;/p&gt;

&lt;p&gt;But a platform that simplifies everything also imposes limits, and for any company that has scaled past the initial bootstrap phase, those limits eventually hit two core metrics: &lt;strong&gt;control&lt;/strong&gt; and &lt;strong&gt;cost&lt;/strong&gt;. What starts as the fastest way to market often becomes a budget bottleneck and a strategic constraint.&lt;/p&gt;

&lt;p&gt;Today, with new structural changes at Heroku, the conversation about migration is no longer a matter of "if" or "when," but "now." For any business running a production-critical, profitable service, moving to Kubernetes is no longer just an optimization—it’s a necessary step to secure the next decade of growth and maintain technical sovereignty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Shift at Heroku
&lt;/h2&gt;

&lt;p&gt;On February 6, 2026, Heroku &lt;a href="https://www.heroku.com/blog/an-update-on-heroku/" rel="noopener noreferrer"&gt;announced&lt;/a&gt; a significant strategic realignment. The platform is now transitioning to what they call a sustaining engineering model.&lt;/p&gt;

&lt;p&gt;What does that actually mean for you as a business? It means a shift in investment priority. Heroku remains a stable, production-ready environment, with continued focus on core areas like security, stability, reliability, and support. For existing credit card-paying customers, the day-to-day operations and services remain unchanged.&lt;/p&gt;

&lt;p&gt;The critical piece of news, however, is that Enterprise Account contracts will no longer be offered to new customers. While existing enterprise contracts will be honored, this decision sends a clear strategic signal: Salesforce, the parent of Heroku, is focusing its future engineering efforts elsewhere—specifically on helping organizations build and deploy enterprise-grade AI in a secure way, rather than focusing on the core, undifferentiated platform features that many growth companies rely on.&lt;/p&gt;

&lt;p&gt;In short, the platform you relied on for your MVP is telling you, quite clearly, that its main focus is changing. For a high-growth business, relying on a platform that has decided to stop innovating in your core area of need is an unacceptable risk. The decision to migrate has now moved from a "good idea" to a strategic imperative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is Kubernetes a good choice?
&lt;/h2&gt;

&lt;p&gt;The cloud landscape has matured dramatically since Heroku first took center stage. While Heroku pioneered the developer-first experience, Kubernetes is already an industry standard and majority of the companies are already using it in Production. For any company that has achieved PMF, Kubernetes offers benefits that directly address the pain points of a scaled Heroku implementation. You may ask why not using products like Portainer, Render, Fly etc which have been an alternative. Yes, you can use them but it is still gaining more control on the platform and spending.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reclaiming Sovereignty and Control
&lt;/h3&gt;

&lt;p&gt;With Heroku, you are a tenant in a strictly controlled environment. That simplicity is powerful, but it comes at the cost of ultimate control. Kubernetes flips that dynamic. It gives you the blueprint for your entire infrastructure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multicloud and Hybrid Strategy:&lt;/strong&gt; Kubernetes is a universal API for infrastructure. It provides the freedom to easily shift workloads between major cloud providers (AWS, GCP, Azure), deploy on-premise, or adopt a hybrid strategy. This ability to change providers is a powerful negotiating tool and a key piece of business continuity planning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Sales Enablement:&lt;/strong&gt; For B2B SaaS, especially those with AI-native features, enterprise customers often require strict data sovereignty. They need to self-host services on their own virtual private clouds or on-premise. Heroku architecture simply cannot support this. A Kubernetes-based platform enables you to offer a self-deployed version of your SaaS product, unlocking massive new markets in highly regulated or security-conscious industries. The control Kubernetes offers over data residency and compliance is non-negotiable for selling to large enterprise customers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability and Cost Efficiency
&lt;/h3&gt;

&lt;p&gt;The Heroku pricing model is famously straightforward: it’s easy to calculate, but it is expensive as you scale. This is the trade-off for simplicity.&lt;/p&gt;

&lt;p&gt;By moving to Kubernetes, you gain fine-grained control over resource allocation. You can right-size your instances, consolidate workloads, and select the most cost-effective machine types for specific tasks. While the initial setup requires more attention, the long-term cost savings are significant, especially for services with unpredictable or high-volume usage.&lt;/p&gt;

&lt;p&gt;The ecosystem itself has worked to smooth out the initial complexity. Major cloud providers now offer "autopilot" in their managed Kubernetes services that handle much of the underlying operational overhead. This means you can gain the cost and control benefits of Kubernetes without the burden of building a huge platform engineering team.&lt;/p&gt;

&lt;p&gt;At CloudRaft, we recognize the need to simplify this process. We’ve built an accelerator called TurboRaft that is essentially a proven playbook for the modern Kubernetes platform. It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitOps with ArgoCD:&lt;/strong&gt; For zero-touch, automated, and auditable releases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Secured secret management, automated certificate management, SAST, SBOMs and vulnerability management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Open-source monitoring with options to choose from and alerting to keep costs low while maintaining deep insight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance:&lt;/strong&gt; Clear policies enforced for compliance and cost control.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to deliver the "Heroku-like" ease of use for developers, but on a platform you own and control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Maturation of the Kubernetes ecosystem
&lt;/h3&gt;

&lt;p&gt;A few years ago, managing Kubernetes was a job for seasoned experts. Today, the complexity angle has been largely mitigated by a robust and mature ecosystem. Open-source tooling, managed cloud services, and a deep community knowledge base have all contributed to making K8s a practical and reliable choice.&lt;/p&gt;

&lt;p&gt;The old argument that "Kubernetes is too complex" is mostly obsolete for a growing company. The market has solved the hardest parts. What’s left is a highly stable platform that provides the operational rigor required to run business-critical services. The Hacker News discussion &lt;a href="https://news.ycombinator.com/item?id=37379078" rel="noopener noreferrer"&gt;thread&lt;/a&gt; on the Heroku news highlights this exact sentiment, with many leaders realizing that the ecosystem is ready for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  A structured approach to migration
&lt;/h2&gt;

&lt;p&gt;No platform migration is easy; it’s a non-trivial engineering effort that must be planned as a business-critical project. Done correctly, it is an opportunity to not just move your app, but to make it stronger and more resilient for the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Assessment and Re-Architecture
&lt;/h3&gt;

&lt;p&gt;This is the most crucial phase. A migration should also be seen as a refactoring opportunity. If your application isn't strictly following cloud-native principles or the &lt;strong&gt;Twelve-Factor App&lt;/strong&gt; methodology, now is the time to correct it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Risk Identification:&lt;/strong&gt; We begin with a full risk assessment, examining each service in the application. We categorize them by current stability, coupling, and size to create a phased migration plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sizing and Cost Modeling:&lt;/strong&gt; Understanding the true resource needs of each service allows us to create accurate Kubernetes deployment specifications and a detailed cost projection for the new platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Simplifying the Developer Experience
&lt;/h3&gt;

&lt;p&gt;The biggest win of Heroku was the abstraction of infrastructure. We need to replicate that ease of use on Kubernetes. Developers should not need to become Kubernetes experts overnight.&lt;/p&gt;

&lt;p&gt;We convert services into Kubernetes deployments using Helm charts, then we abstract the low-level Kubernetes constructs. The goal is a simplified interface—whether it’s a basic YAML or JSON configuration—that lets developers manage their application settings without worrying about the underlying cluster management. This retains the core developer efficiency that made Heroku so appealing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Data Migration Challenge
&lt;/h3&gt;

&lt;p&gt;Applications are often the easy part; the database is where the real complexity lies. A successful migration requires a strategy for moving data with near-zero downtime.&lt;/p&gt;

&lt;p&gt;We strongly recommend self-hosted database solutions on Kubernetes, particularly CloudNativePG for PostgreSQL. Running your own highly-available, self-managed database on Kubernetes removes the premium cost of proprietary cloud-managed services while providing superior control over failover and disaster recovery. We’ve found CloudNativePG to be highly reliable and offer &lt;a href="http://www.cloudraft.io/postgresql-consulting" rel="noopener noreferrer"&gt;full consulting and support&lt;/a&gt; to ensure a smooth, near-zero-downtime data migration. The database upgrade and management was easy in Heroku and with CloudNativePG and our best practices, you can have the database on auto pilot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time to act is now
&lt;/h2&gt;

&lt;p&gt;The shift at Heroku is a clear alarm bell. Ignoring it means accepting escalating costs and a growing strategic risk. You now have a proven, mature, and cost-effective alternative in Kubernetes.&lt;/p&gt;

&lt;p&gt;Success in this migration hinges on two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Selecting a Proven Playbook:&lt;/strong&gt; You need a tested, end-to-end framework that accounts for application, database, and operational complexities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Right Team:&lt;/strong&gt; You need a partner who has navigated this journey before and can deliver the platform quickly, abstracting away the unnecessary complexity while leaving you with full control.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is where &lt;strong&gt;CloudRaft&lt;/strong&gt; comes in. We offer not just the accelerator, but the &lt;a href="https://dev.to/kubernetes-consulting"&gt;consulting&lt;/a&gt; and operational support to execute the migration and hand over a platform that is ready for enterprise-level growth. Don't wait until the cost pressure or strategic uncertainty becomes a crisis—secure your future with a modern, controlled, and cost-efficient Kubernetes platform today.&lt;/p&gt;

</description>
      <category>heroku</category>
    </item>
    <item>
      <title>Context Graphs for AI Agents: The Complete Implementation Guide</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Thu, 29 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/context-graphs-for-ai-agents-the-complete-implementation-guide-4jko</link>
      <guid>https://forem.com/cloudraft/context-graphs-for-ai-agents-the-complete-implementation-guide-4jko</guid>
      <description>&lt;h2&gt;
  
  
  Why Context Graphs Matter Now for AI Agents?
&lt;/h2&gt;

&lt;p&gt;In the past few months, AI has shifted from chatbots to agents, autonomous systems that don't just answer questions but make decisions, approve exceptions, route escalations, and execute workflows across enterprise systems. &lt;a href="https://foundationcapital.com/context-graphs-ais-trillion-dollar-opportunity/" rel="noopener noreferrer"&gt;Foundation Capital&lt;/a&gt; recently called this shift AI's "trillion-dollar opportunity," arguing that enterprise value is migrating from traditional systems of record to systems that capture decision traces, the "why" behind every action.&lt;/p&gt;

&lt;p&gt;But here's the problem: agents deployed without proper context infrastructure are failing at scale, with customers reporting "1,000+ AI instances with no way to govern them" and "all kinds of agentic tools that none talk to each other" as stated in &lt;a href="https://metadataweekly.substack.com/p/context-graphs-are-a-trillion-dollar" rel="noopener noreferrer"&gt;Metadata Weekly&lt;/a&gt;. The issue isn't the AI models themselves, it's that agents lack the structured knowledge foundation they need to reason reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Missing Infrastructure: Relationship-Based Context
&lt;/h3&gt;

&lt;p&gt;47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024 &lt;a href="https://sloanreview.mit.edu/projects/the-emerging-agentic-enterprise-how-leaders-must-navigate-a-new-age-of-ai/" rel="noopener noreferrer"&gt;MIT Sloan Management Review&lt;/a&gt;. Even when agents don't hallucinate outright, they struggle with multi-step reasoning that requires connecting distant facts across systems. An agent might know a customer filed a complaint and know about a recent product defect and know the refund policy, but fail to connect these relationships to understand why an exception should be granted.&lt;/p&gt;

&lt;p&gt;As Prukalpa Sankar, co-founder of Atlan, frames it: "In 2025, in the dawn of the AI era, context is king" in her &lt;a href="https://atlan.com/know/closing-the-context-gap/" rel="noopener noreferrer"&gt;article&lt;/a&gt;. Context Graphs provide this missing infrastructure by organizing information as an interconnected network of entities and relationships, enabling &lt;a href="https://dev.to/ai-solutions"&gt;AI agents&lt;/a&gt; to traverse meaningful connections, reason across multiple facts, and deliver explainable decisions.&lt;/p&gt;

&lt;p&gt;This comprehensive guide explains what Context Graphs are, how they work, and why they're becoming essential infrastructure for enterprise AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Context Graph? Definition, Use Cases &amp;amp; Implementation Guide
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1769615825%2Fblogs%2Fcontext-graph-for-ai-agents%2Fcontext_graph_hdua0t.avif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1769615825%2Fblogs%2Fcontext-graph-for-ai-agents%2Fcontext_graph_hdua0t.avif" alt="Context Graph" width="1536" height="1024"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Context Graphs Work
&lt;/h3&gt;

&lt;p&gt;Context Graphs transform raw data into a semantic network of nodes (entities like people or projects), directed edges (relationships such as "worked_on" or "depends_on"), and properties (key-value details on both). This structure enables AI agents to perform graph traversals, starting from a query node and following relevant edges, for dynamic context assembly and multi-hop reasoning, unlike rigid keyword or vector searches.&lt;/p&gt;

&lt;h4&gt;
  
  
  Core Components:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nodes:&lt;/strong&gt; Represent real-world entities (e.g. "ProjectX"). Each holds properties like name, type, or timestamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges:&lt;/strong&gt; Directed connections with types (e.g. → "worked_on" →) and properties (e.g. role: "lead", duration: "6 months"). Directions indicate flow, like cause-effect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Properties:&lt;/strong&gt; Metadata attached to nodes/edges (e.g., confidence score on an edge), enabling filtered traversals.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Traversal Process:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query Entry:&lt;/strong&gt; Input like "API security projects" matches starting nodes via properties or embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neighbor Expansion:&lt;/strong&gt; Fetch adjacent nodes/edges, prioritizing by relevance (e.g., recency, strength).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Hop Pathfinding:&lt;/strong&gt; Traverse 2-4 hops (e.g. Project → worked_on → Engineer → similar_to → AuthSystem), using algorithms like BFS or HNSW-inspired graphs for efficiency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Assembly:&lt;/strong&gt; Aggregate paths into a subgraph, feeding it to LLMs for grounded reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainability:&lt;/strong&gt; Log the path for auditing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This mirrors vector DB indexing (e.g. HNSW in Pinecone) but emphasizes relational paths over pure similarity.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example in Action:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Traditional Vector Search (e.g., Pinecone nearest-neighbor):&lt;/strong&gt; "API security projects" → Returns docs with similar embeddings (e.g. 3 keyword matches).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Graph Traversal:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="n"&gt;cypher&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Project&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;RELATED_TO&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Topic&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'API Security'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;related&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start:&lt;/strong&gt; Projects tagged "API Security".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hop 1:&lt;/strong&gt; → worked_on_by → Engineers (properties: skills="OAuth").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hop 2:&lt;/strong&gt; Engineers → also_worked_on → AuthSystems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hop 3:&lt;/strong&gt; AuthSystems → depends_on → OAuthProtocols (properties: version="2.0").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; Subgraph with projects, team, deps, contributors—plus path visualization for explainability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Characteristics of Context Graphs
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relationship-Centric Design:&lt;/strong&gt; Context Graphs prioritize connections over isolated records. This makes it natural to understand how concepts relate, not just what they contain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Hop Reasoning:&lt;/strong&gt; The graph structure enables AI to connect distant concepts through intermediate relationships, reasoning across multiple steps just as humans do. Example: Connecting "customer complaint" → "product defect" → "supplier issue" → "quality control process" in three hops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic Context Assembly:&lt;/strong&gt; Rather than retrieving fixed search results, Context Graphs assemble context on the fly by traversing only the relationships relevant to your specific query.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in Explainability:&lt;/strong&gt; Every AI decision can be traced back through its relationship path. You can see exactly how the system reached a conclusion, critical for enterprise and regulated environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Temporal Intelligence:&lt;/strong&gt; Context Graphs model sequences, dependencies, and cause-and-effect relationships over time, making them ideal for understanding evolving processes and events.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enterprise Scalability:&lt;/strong&gt; Modern graph databases handle millions of entities while maintaining fast traversal and query performance at scale.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Context Graph vs Knowledge Graph vs Vector Database
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Context Graph&lt;/th&gt;
&lt;th&gt;Knowledge Graph&lt;/th&gt;
&lt;th&gt;Vector Database&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary Focus&lt;/td&gt;
&lt;td&gt;Contextual relationships for AI reasoning&lt;/td&gt;
&lt;td&gt;General knowledge representation&lt;/td&gt;
&lt;td&gt;Semantic similarity matching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Type&lt;/td&gt;
&lt;td&gt;Multi-hop traversal&lt;/td&gt;
&lt;td&gt;Structured queries&lt;/td&gt;
&lt;td&gt;Nearest neighbor search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Dynamic AI context assembly&lt;/td&gt;
&lt;td&gt;Structured domain knowledge&lt;/td&gt;
&lt;td&gt;Semantic search, &lt;a href="https://dev.to/what-is/retrieval-augmented-generation"&gt;RAG&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explainability&lt;/td&gt;
&lt;td&gt;High (shows relationship paths)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low (similarity scores only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query Complexity&lt;/td&gt;
&lt;td&gt;Complex multi-step reasoning&lt;/td&gt;
&lt;td&gt;Medium complexity&lt;/td&gt;
&lt;td&gt;Simple similarity queries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; These technologies complement each other. Many advanced AI systems use Context Graphs for reasoning combined with &lt;a href="https://dev.to/blog/top-5-vector-databases"&gt;vector databases&lt;/a&gt; for semantic search.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Context Graph Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Knowledge Management:&lt;/strong&gt; Connect projects, people, decisions, and outcomes across your organization. Instead of finding where files live, trace how work evolved, what decisions shaped results, and who has relevant expertise. This will reduce your knowledge discovery time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intelligent Customer Support:&lt;/strong&gt; Go beyond keyword matching. Connect customer history, product configurations, known issues, and documented resolutions to provide contextually accurate answers. This will reduce your ticket resolution time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scientific Research &amp;amp; Discovery:&lt;/strong&gt; Connect millions of research papers, creating networks of studies, methodologies, findings, and citations. Discover unexpected connections between seemingly unrelated fields. You can identify underexplored research areas by analyzing relationship patterns and citation gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance &amp;amp; Risk Management:&lt;/strong&gt; Map relationships between regulations, internal policies, business processes, and controls. When requirements change, trace exactly where those changes affect systems and workflows. This will reduce your compliance audit preparation time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Healthcare Diagnostics:&lt;/strong&gt; Connect symptoms, medical history, medications, genetic factors, and research findings. Enable diagnostic systems to reason across these relationships and identify conditions that isolated analysis might miss. This will improve diagnostic accuracy by surfacing relevant but non-obvious connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supply Chain Optimization:&lt;/strong&gt; Model your entire supply network, suppliers, components, products, logistics partners, enabling sophisticated scenario analysis and rapid disruption response. For example, when supply issues arise, it will quickly identify alternative suppliers by traversing compatibility, certification, and performance relationships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal Research &amp;amp; Analysis:&lt;/strong&gt; Map relationships between cases, statutes, legal principles, and precedents. Trace how legal concepts evolved across jurisdictions and time periods. This would reduce legal research time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personalized Recommendations:&lt;/strong&gt; Go beyond "customers who bought this also bought that." Understand topical relationships, creator connections, and contextual relevance to deliver truly personalized recommendations. This would increase engagement through unexpected but relevant discoveries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial Risk Assessment:&lt;/strong&gt; Model relationships between entities, transactions, accounts, and market factors. Detect complex fraud patterns spanning multiple accounts and understand how risks cascade through connected entities. This would detect more fraud patterns than traditional rule-based systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Software Development Intelligence:&lt;/strong&gt; Map relationships between functions, modules, dependencies, documentation, and issues. Understand how code changes ripple through your system before making modifications. This would reduce breaking changes through comprehensive impact analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of Context Graphs for AI Agents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reduce AI Hallucinations:&lt;/strong&gt; Ground AI outputs in explicit, verifiable relationships rather than probabilistic pattern matching alone.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improve Reasoning Accuracy:&lt;/strong&gt; When answers require connecting multiple facts across domains, Context Graphs significantly outperform retrieval-only approaches.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable Explainable AI:&lt;/strong&gt; Expose the exact path the AI took through your knowledge graph, making decisions transparent and auditable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale Without Schema Rigidity:&lt;/strong&gt; Add new entity types and relationships without forcing disruptive schema migrations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Surface Hidden Insights:&lt;/strong&gt; Discover patterns and connections that are nearly impossible to detect in traditional table or document structures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Maintain Context Across Interactions:&lt;/strong&gt; Preserve relationship context throughout multi-turn conversations, enabling more sophisticated AI interactions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to Implement Context Graphs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Select Your Graph Database
&lt;/h3&gt;

&lt;p&gt;Choose based on scale, query patterns, and infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some Popular Options:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Neo4j:&lt;/strong&gt; Most mature, enterprise-ready, excellent query language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Neptune:&lt;/strong&gt; Managed AWS service, good for existing AWS infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TigerGraph:&lt;/strong&gt; Best for massive scale and complex analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ArangoDB:&lt;/strong&gt; Multi-model database with graph capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FalkorDB:&lt;/strong&gt; Ultra-fast in-memory graph database built on Redis, best for low-latency real-time applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision Factors:&lt;/strong&gt; Query complexity, data volume, team expertise, budget&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Design Your Relationship Schema
&lt;/h3&gt;

&lt;p&gt;The value of a Context Graph depends on modeling the right entities and relationships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best Practice:&lt;/strong&gt; Collaborate closely with domain experts who understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What entities matter in your domain&lt;/li&gt;
&lt;li&gt;Which relationships drive important decisions&lt;/li&gt;
&lt;li&gt;How information flows through your processes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Schema (Customer Support):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entities:&lt;/strong&gt; Customer, Ticket, Product, Issue, Resolution, Agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships:&lt;/strong&gt; reported_by, relates_to, resolved_with, escalated_to, similar_to&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Build Entity Extraction
&lt;/h3&gt;

&lt;p&gt;Identify entities in your source data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Unstructured Text:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use NLP pipelines&lt;/li&gt;
&lt;li&gt;Fine-tune LLMs for domain-specific entity recognition&lt;/li&gt;
&lt;li&gt;Implement human-in-the-loop validation for critical entities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For Structured Data:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Map existing database fields directly to graph entities&lt;/li&gt;
&lt;li&gt;Normalize entity references across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Develop Relationship Extraction
&lt;/h3&gt;

&lt;p&gt;Beyond identifying entities, determine how they relate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approaches:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rule-based:&lt;/strong&gt; Define explicit patterns (if X mentions Y in context Z, create relationship R)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ML-based:&lt;/strong&gt; Train models to identify relationship types from text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-based:&lt;/strong&gt; Use large language models for sophisticated relationship inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human validation:&lt;/strong&gt; Review critical relationship paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 5: Enable Real-Time Updates
&lt;/h3&gt;

&lt;p&gt;Context Graphs are living systems requiring continuous updates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement event-driven architecture for data changes&lt;/li&gt;
&lt;li&gt;Design incremental update patterns (don't rebuild everything)&lt;/li&gt;
&lt;li&gt;Maintain data lineage for troubleshooting&lt;/li&gt;
&lt;li&gt;Build conflict resolution for concurrent updates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 6: Optimize Query Performance
&lt;/h3&gt;

&lt;p&gt;Keep multi-hop queries responsive at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Index critical properties used in traversals&lt;/li&gt;
&lt;li&gt;Cache frequent query patterns&lt;/li&gt;
&lt;li&gt;Limit traversal depth for expensive queries&lt;/li&gt;
&lt;li&gt;Denormalize selectively for performance-critical paths&lt;/li&gt;
&lt;li&gt;Use query profiling to identify bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 7: Integrate Graph Analytics
&lt;/h3&gt;

&lt;p&gt;Enhance your Context Graph with advanced algorithms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PageRank:&lt;/strong&gt; Identify influential nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Detection:&lt;/strong&gt; Find clusters of related entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Path Finding:&lt;/strong&gt; Discover optimal routes through relationships&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph Embeddings:&lt;/strong&gt; Enable similarity calculations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Link Prediction:&lt;/strong&gt; Suggest missing relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Challenges &amp;amp; Solutions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;th&gt;Practical Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Graph Construction Complexity&lt;/td&gt;
&lt;td&gt;Building comprehensive graphs requires sophisticated entity and relationship extraction from unstructured data&lt;/td&gt;
&lt;td&gt;Start with a focused domain where you have high-quality structured data. Expand gradually as you build extraction capabilities.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Design Expertise&lt;/td&gt;
&lt;td&gt;Effective schemas demand deep domain understanding, poor design leads to unusable graphs&lt;/td&gt;
&lt;td&gt;Run workshops with subject matter experts. Build iteratively: start simple, refine based on actual query patterns.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance at Scale&lt;/td&gt;
&lt;td&gt;Graph traversals become expensive for complex multi-hop queries as data grows&lt;/td&gt;
&lt;td&gt;Invest in proper indexing, implement query optimization, use caching strategically, and set traversal depth limits (2-4 hops).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Entity Resolution&lt;/td&gt;
&lt;td&gt;Identifying that different mentions refer to the same entity is difficult but critical for accuracy&lt;/td&gt;
&lt;td&gt;Implement fuzzy matching, leverage unique identifiers where available, use ML-based entity resolution tools, maintain a golden record system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality Maintenance&lt;/td&gt;
&lt;td&gt;As graphs grow to millions of relationships, maintaining accuracy becomes challenging&lt;/td&gt;
&lt;td&gt;Implement automated validation rules, schedule periodic audits, track data lineage, enable user feedback loops for corrections.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration Complexity&lt;/td&gt;
&lt;td&gt;Incorporating Context Graphs into existing systems requires architectural changes and API design&lt;/td&gt;
&lt;td&gt;Build a graph API layer that existing systems can call. Start with read-only integration, add write capabilities once proven.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill Gap&lt;/td&gt;
&lt;td&gt;Shortage of professionals experienced in graph technologies and query languages like Cypher&lt;/td&gt;
&lt;td&gt;Train existing team members (graph databases are learnable, similar to SQL), hire contractors for initial setup, or partner with CloudRaft for implementation guidance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Management&lt;/td&gt;
&lt;td&gt;Context Graphs add infrastructure costs for databases, extraction pipelines, and real-time analytics&lt;/td&gt;
&lt;td&gt;Start with a high-value use case to demonstrate ROI. Scale infrastructure based on actual usage patterns. Monitor cost per query and optimize expensive operations.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Context Graph Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Design Principles
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model relationships that drive decisions:&lt;/strong&gt; Don't create relationships just because you can. Focus on connections that enable valuable reasoning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep entity types focused:&lt;/strong&gt; Avoid creating overly granular entity types. Each entity type should represent a meaningful concept in your domain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Make relationships meaningful:&lt;/strong&gt; Generic relationships like "related_to" provide little value. Use specific relationship types: "depends_on," "caused_by," "replaces."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Balance normalization and performance:&lt;/strong&gt; Highly normalized graphs are elegant but can be slow. Denormalize strategically for frequently traversed paths.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Version your schema:&lt;/strong&gt; Graph schemas evolve. Maintain version history and migration paths.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Query Optimization
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Limit traversal depth:&lt;/strong&gt; Set maximum hops to prevent runaway queries. Most valuable relationships are within 2-4 hops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Filter early:&lt;/strong&gt; Apply constraints as early as possible in your traversal to reduce the working set.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use indexed properties:&lt;/strong&gt; Index properties you filter on frequently. This dramatically improves query performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache common patterns:&lt;/strong&gt; Identify frequently executed query patterns and cache results with appropriate TTLs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Data Quality
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Implement validation rules:&lt;/strong&gt; Define constraints on entity properties and relationship validity to maintain quality automatically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Track provenance:&lt;/strong&gt; Know where each entity and relationship came from. This enables troubleshooting and quality assessment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable feedback loops:&lt;/strong&gt; Allow users to report incorrect relationships. Use this feedback to improve extraction pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schedule audits:&lt;/strong&gt; Periodically review graph quality, especially for critical relationship paths.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Context Graphs + LLMs: A Powerful Combination
&lt;/h2&gt;

&lt;p&gt;Context Graphs and Large Language Models (LLMs) complement each other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph-Augmented Generation (GAG):&lt;/strong&gt; Retrieve relevant subgraphs from your Context Graph and provide them as structured context to LLMs. This reduces hallucinations and grounds responses in your actual knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-Assisted Graph Construction:&lt;/strong&gt; Use LLMs to extract entities and relationships from unstructured text, building your Context Graph more quickly than rule-based approaches alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explainable LLM Reasoning:&lt;/strong&gt; When LLMs generate responses based on graph context, you can trace exactly which relationships influenced the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid Retrieval:&lt;/strong&gt; Combine vector search (for semantic similarity) with graph traversal (for relationship reasoning) to get the best of both approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring Context Graph Success
&lt;/h2&gt;

&lt;p&gt;Track these metrics to assess your Context Graph implementation:&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Performance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Response time:&lt;/strong&gt; Median and 95th percentile query latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; Queries per second at peak usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache hit rate:&lt;/strong&gt; Percentage of queries served from cache&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Quality
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entity accuracy:&lt;/strong&gt; Percentage of correctly identified entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationship precision:&lt;/strong&gt; Percentage of relationships that are actually valid&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage:&lt;/strong&gt; Percentage of domain knowledge captured in the graph&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time saved:&lt;/strong&gt; Reduction in research/discovery time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy improvement:&lt;/strong&gt; Better decision quality from enhanced reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost reduction:&lt;/strong&gt; Decreased manual effort for knowledge work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User satisfaction:&lt;/strong&gt; NPS or satisfaction scores for graph-powered features&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AI Performance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination rate:&lt;/strong&gt; Reduction in factually incorrect AI outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning accuracy:&lt;/strong&gt; Percentage of multi-hop questions answered correctly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainability:&lt;/strong&gt; Percentage of AI decisions with traceable reasoning paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Future of Context Graphs
&lt;/h2&gt;

&lt;p&gt;Context Graphs are evolving rapidly:&lt;/p&gt;

&lt;h3&gt;
  
  
  Emerging Trends
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph + Vector Hybrid Systems:&lt;/strong&gt; Combining semantic vector search with graph reasoning for more sophisticated AI systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated Schema Evolution:&lt;/strong&gt; ML systems that automatically suggest new entity types and relationships based on usage patterns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-Time Graph Analytics:&lt;/strong&gt; Stream processing for graph updates and real-time pattern detection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Modal Graphs:&lt;/strong&gt; Incorporating images, audio, and video as first-class entities with rich relationships.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Federated Graphs:&lt;/strong&gt; Connecting knowledge graphs across organizational boundaries while maintaining privacy and security.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started with Context Graphs
&lt;/h2&gt;

&lt;p&gt;Ready to implement Context Graphs in your AI systems?&lt;/p&gt;

&lt;h3&gt;
  
  
  Start Small, Think Big
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Identify a high-value use case where relationship reasoning matters&lt;/li&gt;
&lt;li&gt;Map your initial schema with domain experts (10-20 entity types is plenty to start)&lt;/li&gt;
&lt;li&gt;Build a proof of concept with a subset of your data&lt;/li&gt;
&lt;li&gt;Measure impact against your baseline approach&lt;/li&gt;
&lt;li&gt;Iterate and expand based on what you learn&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Common Starting Points
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer support:&lt;/strong&gt; Connect tickets, customers, products, and resolutions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal knowledge:&lt;/strong&gt; Link documents, projects, people, and decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Map regulations, policies, processes, and controls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product development:&lt;/strong&gt; Connect features, dependencies, bugs, and releases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Context Graphs represent a fundamental shift in how AI systems understand and reason about information. By capturing not just data, but the rich network of relationships that gives data meaning, they unlock AI capabilities that were previously unattainable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More accurate reasoning through multi-hop traversal&lt;/li&gt;
&lt;li&gt;Explainable decisions via traceable relationship paths&lt;/li&gt;
&lt;li&gt;Reduced hallucinations by grounding in verifiable connections&lt;/li&gt;
&lt;li&gt;Scalable knowledge management without rigid schema constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As AI becomes increasingly central to enterprise operations, Context Graphs will evolve from competitive advantage to foundational infrastructure. Organizations that build graph-based AI capabilities now will be well-positioned to lead in an AI-driven future.&lt;/p&gt;

&lt;p&gt;The question isn't whether to adopt Context Graphs, it's when and where to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expert Help with Context Graph Implementation
&lt;/h2&gt;

&lt;p&gt;Building Context Graphs requires specialized expertise in graph databases, knowledge representation, and AI integration. CloudRaft provides complimentary AI consultations to help you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assess feasibility for your specific use cases&lt;/li&gt;
&lt;li&gt;Design optimal schemas for your domain&lt;/li&gt;
&lt;li&gt;Architect scalable infrastructure that grows with your needs&lt;/li&gt;
&lt;li&gt;Integrate with existing AI systems seamlessly&lt;/li&gt;
&lt;li&gt;Train your team on graph technologies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between a Context Graph and a Knowledge Graph?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Context Graphs are specialized knowledge graphs optimized for dynamic context assembly in AI systems. While knowledge graphs broadly represent domain knowledge, Context Graphs focus specifically on enabling AI reasoning through relationship traversal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use Context Graphs with vector databases?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Absolutely. Many advanced AI systems use both, vector databases for semantic similarity search and Context Graphs for relationship reasoning. This hybrid approach provides the best of both worlds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much data do I need to start?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can start small. Even a few thousand entities with well-modeled relationships can demonstrate value. Focus on quality relationships over quantity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the typical implementation timeline?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a focused proof of concept: 4-8 weeks. For production-ready implementation: 3-6 months. Timeline depends on data complexity, schema design, and integration requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need specialized graph database skills?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While helpful, they're not mandatory. Graph query languages like Cypher (Neo4j) are learnable, similar to SQL. Consider training existing team members or partnering with experts for initial setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do Context Graphs reduce AI hallucinations?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By grounding AI responses in explicit, verifiable relationships rather than relying solely on probabilistic pattern matching from training data. The AI can only traverse relationships that actually exist in your graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the ROI of implementing Context Graphs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Varies by use case, but organizations typically see: reduction in knowledge discovery time, improvement in AI reasoning accuracy, and reduction in manual research effort. ROI is highest for knowledge-intensive workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Context Graphs work with my existing databases?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Context Graphs complement existing databases. You can keep transactional data in relational databases and build Context Graphs for relationship reasoning, syncing data between systems.&lt;/p&gt;

</description>
      <category>contextgraph</category>
    </item>
    <item>
      <title>Real-Time Postgres to ClickHouse CDC: Supercharge Analytics with PeerDB</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Thu, 27 Nov 2025 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/real-time-postgres-to-clickhouse-cdc-supercharge-analytics-with-peerdb-29k1</link>
      <guid>https://forem.com/cloudraft/real-time-postgres-to-clickhouse-cdc-supercharge-analytics-with-peerdb-29k1</guid>
      <description>&lt;p&gt;If you are running a heavy SaaS platform, you eventually hit a wall with PostgreSQL. It's fantastic for transactional data (OLTP), but when you try to run complex analytical queries on millions of rows, things slow down.&lt;/p&gt;

&lt;p&gt;We recently tackled this exact problem for a client handling high-volume messaging operations. In one of our customers' analytics dashboards, they were using an AWS Aurora PostgreSQL setup to run analytical queries, and they needed a solution that was fast, reliable, and real-time.&lt;/p&gt;

&lt;p&gt;Here is how we solved it by building a high-performance replication pipeline from Postgres to ClickHouse using PeerDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764248123%2Fblogs%2Fpeerdb%2Fanalytics_rskyis.avif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764248123%2Fblogs%2Fpeerdb%2Fanalytics_rskyis.avif" alt="Analytics" width="1918" height="676"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ClickHouse?
&lt;/h2&gt;

&lt;p&gt;ClickHouse is the superior choice for analytics because it is a purpose-built OLAP database designed for high-performance data processing, unlike PostgreSQL, which is a row-based OLTP system better suited for transactional workloads. Its columnar storage architecture allows it to handle massive datasets with sub-millisecond latency, where standard Postgres deployments often hit performance walls. By switching to ClickHouse, you gain the ability to ingest millions of rows and execute complex analytical queries instantly, solving the performance limitations inherent in using PostgreSQL for analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CDC Landscape: Why We Chose PeerDB
&lt;/h2&gt;

&lt;p&gt;Real-time Change Data Capture (CDC) is the standard for moving data without slowing down your primary database. But how do you implement it? Here are the primary CDC (Change Data Capture) options for replicating data from PostgreSQL to ClickHouse that we have considered in our implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. PeerDB
&lt;/h3&gt;

&lt;p&gt;PeerDB is a specialised tool designed specifically for PostgreSQL to ClickHouse replication. It was the chosen solution in the provided design document due to its balance of performance and simplicity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; It can run as a Docker container stack (PeerDB Server, UI, etc.) and connects directly to the Postgres logical replication slot.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Performance:&lt;/strong&gt; PeerDB was &lt;a href="https://docs.peerdb.io/why-peerdb" rel="noopener noreferrer"&gt;found&lt;/a&gt; to be very performant as compared to other solutions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialised Features:&lt;/strong&gt; It handles initial snapshots (bulk loads) and real-time streaming (CDC) seamlessly. It also supports specific optimisations, such as dividing tables into multiple "mirrors" to speed up initial loads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity:&lt;/strong&gt; It avoids the complexity of managing a full Kafka cluster.
&lt;strong&gt;Cons:&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Edition Limits:&lt;/strong&gt; The community edition lacks built-in authentication for the UI, requiring private network access or VPNs for security or another way to add authentication for the UI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Altinity Sink Connector for ClickHouse
&lt;/h3&gt;

&lt;p&gt;This is a lightweight, single-executable solution often used to avoid the complexity of Kafka. It is developed by Altinity, a major ClickHouse contributor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; It runs as a standalone binary or within a Kafka Connect environment. It connects to Postgres and replicates data to ClickHouse.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational Simplicity:&lt;/strong&gt; It eliminates the need for a Kafka Connect cluster or ZooKeeper, running as a single executable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct Replication:&lt;/strong&gt; Offers a direct path from Postgres to ClickHouse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Schema:&lt;/strong&gt; Can automatically read the Postgres schema and create equivalent ClickHouse tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; In the referenced document, this option was tested but rejected because it did not meet the performance requirements compared to PeerDB.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Debezium and Kafka
&lt;/h3&gt;

&lt;p&gt;This is the industry-standard approach for general-purpose CDC, involving a chain of distinct complex components.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; Postgres → Debezium (Kafka Connect) → Kafka Broker → ClickHouse Sink → ClickHouse.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decoupling:&lt;/strong&gt; The message broker (Kafka) decouples the source from the destination, allowing multiple consumers to read the same stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; Extremely robust for guaranteed message delivery and exactly-once processing (if configured correctly).
&lt;strong&gt;Cons:&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Complexity:&lt;/strong&gt; Requires managing Zookeeper, Kafka Brokers, and Schema Registries. The provided document explicitly mentions avoiding "Kafka Connect framework complexity" as a goal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overhead:&lt;/strong&gt; Significant infrastructure footprint compared to direct replication tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why PeerDB?
&lt;/h3&gt;

&lt;p&gt;We initially tested the Altinity connector but ultimately chose PeerDB. Mainly because of following reasons.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; In our testing, PeerDB offered superior performance for our specific workload compared to other connectors we tried.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialisation:&lt;/strong&gt; It is purpose-built for Postgres-to-ClickHouse replication, handling data type mapping and initial snapshots smoothly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;We opted for a "Keep It Simple" approach to infrastructure. While Kubernetes (EKS) is great, we deployed this on Amazon EC2 to maintain full control over the infrastructure and cost. If you have a team that can handle EKS for you, then that might be a better option. Please &lt;a href="https://www.cloudraft.io/contact-us" rel="noopener noreferrer"&gt;discuss with our team&lt;/a&gt; to find the right solutions for your workload and team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source:&lt;/strong&gt; AWS Aurora (PostgreSQL)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Pipeline:&lt;/strong&gt; PeerDB running via Docker Compose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination:&lt;/strong&gt; A ClickHouse cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High Availability Design
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764247641%2Fblogs%2Fpeerdb%2Fclickhouse-architecture_jtgy7m.avif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764247641%2Fblogs%2Fpeerdb%2Fclickhouse-architecture_jtgy7m.avif" alt="High Availability Design" width="1920" height="1080"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source: Altinity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To ensure we never lost data, we configured a ClickHouse cluster with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 Keeper Nodes:&lt;/strong&gt; Using m6i.large instances. These replace ZooKeeper for coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 ClickHouse Server Nodes:&lt;/strong&gt; Using r6i.2xlarge instances for heavy lifting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication:&lt;/strong&gt; We used ReplicatedMergeTree to ensure data exists on multiple nodes for safety&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ClickHouse Cluster
&lt;/h3&gt;

&lt;p&gt;We automated the deployment using Ansible to configure the hardware-aware settings. A cool feature of our setup is that the configuration automatically calculates memory limits and cache sizes based on the EC2 instance's RAM (e.g., leaving 25% for the OS and giving 75% to ClickHouse). We wrote about this earlier in our &lt;a href="https://www.cloudraft.io/blog/building-enterprise-grade-clickhouse-with-ansible" rel="noopener noreferrer"&gt;previous blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing PeerDB
&lt;/h3&gt;

&lt;p&gt;We used Docker Compose to spin up the PeerDB stack. One specific nuance we encountered was configuring the storage abstraction. PeerDB uses MinIO (S3 compatible) for intermediate storage. We had to explicitly set the PEERDB_CLICKHOUSE_AWS_CREDENTIALS_AWS_ENDPOINT_URL_S3 environment variable in our &lt;code&gt;docker-compose.yml&lt;/code&gt; to point to our MinIO host IP.&lt;/p&gt;

&lt;p&gt;Set up the peers to connect with the source and destination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating the "Mirror"
&lt;/h3&gt;

&lt;p&gt;PeerDB uses a concept called &lt;strong&gt;Mirrors&lt;/strong&gt; to handle the CDC pipeline. We set up the connection by defining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Peer (Source):&lt;/strong&gt; Our Aurora Postgres instance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Peer (Destination):&lt;/strong&gt; Our ClickHouse cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Mirror:&lt;/strong&gt; The actual replication job&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PeerDB support different modes of streaming - log based (CDC), cursor based (timestamp or integer) and XMIN based. In our implementation, we used log based (CDC) replication.&lt;/p&gt;

&lt;p&gt;To optimise the initial data load, we didn't just dump everything at once. We divided our tables into multiple "batches" (mirrors) to run in parallel and started at different times so that we do not cause a high load on the source.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Gotchas" From the Trenches
&lt;/h2&gt;

&lt;p&gt;No migration is perfect. Here are three issues we faced so you can avoid them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The "Too Many Parts" Error in ClickHouse&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;ClickHouse loves big batches of data. If PeerDB syncs records one by one or in tiny groups too quickly, ClickHouse can't merge the data parts fast enough in the background. We saw errors like Too many parts... Merges are processing significantly slower than inserts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fix:&lt;/em&gt; You may need to tune the batch size or frequency to slow down the inserts slightly, allowing ClickHouse's merge process to catch up.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Aurora Failovers Break Things&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If AWS Aurora triggers a failover, the IP/DNS resolution might shift. We found that this can break the peering connection.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fix:&lt;/em&gt; You have to edit the peer configuration to point to the new primary host and resync the mirror.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Security on Community Edition
We used the community edition of PeerDB. Be aware that it does not have built-in authentication for the UI.&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fix:&lt;/em&gt; Do not expose the UI to the public internet. We access it via private IP/VPN or put an authentication layer using a third-party product.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion and Key Takeaways
&lt;/h2&gt;

&lt;p&gt;By successfully moving analytical queries off the primary Postgres instance and into ClickHouse, we achieved the sub-millisecond query performance our client required. PeerDB provided us with a robust, real-time CDC solution without the operational headache of managing a Kafka cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways on the Postgres + ClickHouse + PeerDB Combination:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; You get the best of both worlds: PostgreSQL handles fast, reliable transactional (OLTP) workloads, while ClickHouse takes on complex analytical (OLAP) queries with unmatched speed. This separation prevents slow analytical queries from impacting your core application database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Simplicity:&lt;/strong&gt; PeerDB acts as a purpose-built, high-performance bridge. It removes the need to deploy and manage a complex, multi-component CDC stack like Debezium and Kafka, significantly reducing infrastructure complexity and operational overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; This architecture allows your analytics layer (ClickHouse) to scale independently from your transactional layer (Postgres), ensuring that as your data volumes grow, you maintain both OLTP stability and OLAP speed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-Effectiveness:&lt;/strong&gt; By offloading analytical processing, you can often run a smaller, more cost-effective Postgres instance dedicated to its core function, while leveraging ClickHouse's efficiency for massive-scale querying.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Are you looking to improve your analytics pipeline? Please &lt;a href="https://www.cloudraft.io/contact-us" rel="noopener noreferrer"&gt;book a call&lt;/a&gt; with us to discuss your case.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>analytics</category>
      <category>clickhouse</category>
      <category>peerdb</category>
    </item>
    <item>
      <title>Why high performance storage is important for AI Cloud Build</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Wed, 24 Sep 2025 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/why-high-performance-storage-is-important-for-ai-cloud-build-3ok2</link>
      <guid>https://forem.com/cloudraft/why-high-performance-storage-is-important-for-ai-cloud-build-3ok2</guid>
      <description>&lt;p&gt;The AI cloud market is experiencing exceptionally rapid growth worldwide, with the latest reports projecting annual growth rates between 28% and 40% over the next five years. It may reach up to $647 billion by 2030 as per various analyst reports. The surge in AI Cloud adoption, GPU-as-a-service platforms, and enterprise interest in AI “factories” has created new pressures and opportunities for product engineering and IT leaders. Regardless of which public cloud or private cluster you choose, one key differentiator sets each AI and HPC solution apart: the &lt;strong&gt;performance of storage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While leading clouds often use the same GPUs and servers, the way data flows—between compute, network, storage, and persistent layers—determines everything from training speed to scalability. Understanding storage fundamentals will help you architect or select the right solution. We have previously covered &lt;a href="https://www.cloudraft.io/blog/how-to-build-ai-cloud" rel="noopener noreferrer"&gt;how to build AI cloud&lt;/a&gt; solutions and with hands-on experience in this space, we would like to cover our thoughts around it in this article.&lt;/p&gt;

&lt;p&gt;Business and technology leaders now recognize that real-world AI breakthroughs require infrastructure with high bandwidth, low latency, and extreme parallelism. As deep learning and data-intensive analytics move from labs to production, GPU clusters run ever-larger models on ever-growing datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does Storage Matter in AI Workloads?
&lt;/h2&gt;

&lt;p&gt;Storage plays an important role across the entire AI lifecycle. Let’s look into all three major areas: data preparation, training &amp;amp; tuning, and inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Preparation
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Key Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Scalable and performant storage to support transforming data for AI use&lt;/li&gt;
&lt;li&gt;Protecting valuable raw and derived training data sets&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Critical Capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Storing large structured and unstructured datasets in many formats&lt;/li&gt;
&lt;li&gt;Scaling under the pressure of map-reduce like distributed processing often used for transforming data for AI&lt;/li&gt;
&lt;li&gt;Support for file and object access protocols to ease integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training &amp;amp; Tuning
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Key Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Providing training data to keep expensive GPUs fully utilized&lt;/li&gt;
&lt;li&gt;Saving and restoring model checkpoints to protect training investments&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Critical Capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Sustaining read bandwidths necessary to keep training GPU resources busy&lt;/li&gt;
&lt;li&gt;Minimizing time to save checkpoint data to limit training pauses&lt;/li&gt;
&lt;li&gt;Scaling to meet demands of data parallel training in large clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Inference
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Key Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Safely storing and quickly delivering model artifacts for inference services&lt;/li&gt;
&lt;li&gt;Providing data for batch inferencing&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Critical Capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Reliably storing expensive to produce model artifact data&lt;/li&gt;
&lt;li&gt;Minimizing model artifact read latency for quick inference deployment&lt;/li&gt;
&lt;li&gt;Sustaining read bandwidths necessary to keep inference GPU resources busy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High Performance Storage is Critical in Checkpointing Process in AI Training
&lt;/h3&gt;

&lt;p&gt;Checkpointing is a critical process in large-scale AI training, enabling models to periodically save and restore their state as training progresses. As model and dataset sizes expand into the billions of parameters and petabytes of data, this operation becomes increasingly demanding for storage infrastructure. Efficient checkpointing helps safeguard training progress against inevitable hardware failures and disruptions, while also allowing for fine-tuning, experimentation, and rapid recovery. However, frequent checkpointing can introduce performance overhead due to pauses in computation and intensive reads/writes to persistent storage, especially when distributed clusters grow to thousands of accelerators.&lt;/p&gt;

&lt;p&gt;To address these challenges, modern AI storage architecture leverages strategies such as asynchronous checkpointing—where checkpoints are saved in the background, minimizing idle time—and hierarchical distribution, reducing bottlenecks by having leader nodes manage data transfers within clusters. The result is faster training throughput, lower risk of lost work, and more efficient use of compute resources. Optimizing for checkpoint size, frequency, and concurrent access patterns is vital to ensure high throughput and low latency, making high-performance scalable storage systems an indispensable foundation for reliable, cost-effective AI model training at scale. You can read more about it in this &lt;a href="https://aws.amazon.com/blogs/storage/architecting-scalable-checkpoint-storage-for-large-scale-ml-training-on-aws/" rel="noopener noreferrer"&gt;AWS article&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kind of Storage Is Needed for AI and HPC Workloads?
&lt;/h2&gt;

&lt;p&gt;For AI and HPC workloads, the demands extend well beyond ordinary enterprise storage. Key requirements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallel File Systems:&lt;/strong&gt; Multiple servers and GPUs need to access datasets at the same time. Systems such as Lustre, WEKA, VAST Data, CephFS, and DDN Infinia enable concurrent access, avoiding bottlenecks and improving throughput for distributed workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Throughput and Low Latency:&lt;/strong&gt; Training GPT-like models or running simulations generates millions of read/write operations per second. Storage must deliver bandwidth in the tens to hundreds of GB/s and latency below 1ms, so that GPUs remain fed and productive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POSIX Compliance:&lt;/strong&gt; Many AI frameworks and HPC applications expect a traditional POSIX interface for seamless operation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability and Elasticity:&lt;/strong&gt; Petabyte-scale capacity is the norm. Modern solutions allow you to scale horizontally, adding performance and capacity as demand grows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Integrity and Reliability:&lt;/strong&gt; Enterprise-grade AI and HPC workloads need uninterrupted access to their data. Redundancy, fault tolerance, and robust disaster recovery features matter.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Typical Storage Specifications and Requirements
&lt;/h2&gt;

&lt;p&gt;For modern AI Cloud or AI factory, and GPU Cloud infrastructure, expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bandwidth:&lt;/strong&gt; 15–512 GB/s (or higher for top-tier solutions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IOPS:&lt;/strong&gt; From 20,000 (entry) up to 800,000+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; Sub-1ms to 2ms for parallel file systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity:&lt;/strong&gt; 100TB to multi-petabyte scale, often with tiering to object storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protocols:&lt;/strong&gt; NFSv3/v4.1, SMB, Lustre, S3 (for hybrid and archival storage), HDFS, and native REST APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On-premises or hybrid deployments may include NVMe storage, CXL-enabled expansion, and advanced cooling for supporting high-density GPU clusters.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;AI Lifecycle Stage&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Requirements&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Considerations&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reading Training Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Accommodate wide range of read BW requirements and IO access patterns across different AI models &lt;br&gt; - Deliver large amounts of read BW to single GPU servers for most demanding models&lt;/td&gt;
&lt;td&gt;- Use high performance, all-flash storage to meet needs &lt;br&gt; - Leverage RDMA capable storage protocols, when possible, for most demanding requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Saving Checkpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Provide large sequential write bandwidth for quickly saving checkpoints&lt;br&gt; - Handle multiple large sequential write streams to separate files, especially in same directory&lt;/td&gt;
&lt;td&gt;- Understand checkpoint implementation details and behaviors for expected AI workloads&lt;br&gt; - Determine time limits for completing checkpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Restoring Checkpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Provide large sequential read bandwidth for quickly restoring checkpoints &lt;br&gt; - Handle multiple large sequential read streams to same checkpoint file&lt;/td&gt;
&lt;td&gt;- Understand how often checkpoint restoration will be required &lt;br&gt; - Determine acceptable time limits for restoration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Servicing GPU Clusters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Meet performance requirements for mixed storage workloads from multiple simultaneous AI jobs &lt;br&gt; - Scale capacity and performance as GPU clusters grow with business needs&lt;/td&gt;
&lt;td&gt;- Consider scale-out storage platforms that can increase performance and capacity while providing shared access to data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: snia.org - John Cardente Talk&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage options for AI Cloud and HPC Workloads
&lt;/h2&gt;

&lt;p&gt;To achieve next-generation AI and HPC results, enterprises and product teams should evaluate both commercial vendors and open source platforms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open Source Parallel File Systems
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ceph (CephFS):&lt;/strong&gt; Highly flexible, POSIX-compliant, scales from small clusters to exabytes. Used in academic and commercial AI labs for robust file and object storage. Many early stage AI factories are using solutions built on top of Ceph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lustre / DDN Lustre:&lt;/strong&gt; Optimized for large-scale HPC and AI workloads. Used in many supercomputing and enterprise environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IBM Spectrum Scale (GPFS):&lt;/strong&gt; High-performing parallel file system, widely used in science and industry.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Commercial AI and HPC Storage Solutions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VAST Data:&lt;/strong&gt; Delivers extreme performance for AI storage, marrying parallel file system performance with the economics of NAS and archive. Vast has been very popular and adapted by popular AI Cloud players like &lt;a href="https://www.vastdata.com/customers/coreweave" rel="noopener noreferrer"&gt;CoreWeave&lt;/a&gt; and Lambda.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WEKA:&lt;/strong&gt; Highly optimized metadata and file access for AI and multi-tenant clusters; helps overcome bottlenecks experienced in legacy systems. Similar to Vast, Weka has customers such as Yotta, Cohere, and &lt;a href="http://Together.ai" rel="noopener noreferrer"&gt;Together.ai&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDN:&lt;/strong&gt; Industry leader for research, hybrid file-object storage, and scalable data intelligence for model training and analytics. DDN’s solutions, like Infinia and xFusionAI, focus on both performance and efficiency for GPU workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pure Storage, Cloudian, IBM, Dell:&lt;/strong&gt; Also recognized for delivering enterprise-grade AI/HPC storage platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many solutions integrate natively with popular public clouds (AWS S3, Google Cloud Storage, Azure Blob)—enabling hybrid architectures and seamless data movement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Product Examples and Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ceph (Open Source):&lt;/strong&gt; Used by research labs and private cloud teams to build petabyte-scale, resilient storage for AI and HPC clusters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WEKA:&lt;/strong&gt; Enterprise deployments often leverage WEKA for AI factories—a system with hundreds of GPUs running concurrent training jobs—thanks to its elastic scaling and metadata performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAST Data:&lt;/strong&gt; Designed to deliver high throughput for both small and large file operations, increasingly chosen for generative AI workloads and data-intensive analytics in fintech, healthcare, and media.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDN:&lt;/strong&gt; Supports hybrid deployment strategies; offers both parallel file system and object storage in a unified stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parallel file systems such as Lustre and Spectrum Scale facilitate near-instant recovery, zero-data loss architectures, and compliance for regulated sectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying the Best Storage for your needs
&lt;/h2&gt;

&lt;p&gt;Because every cloud environment is unique, the first step in creating a distinctive solution is to establish a baseline through hardware benchmarking. MLCommons' benchmarking tools can be run directly on your hardware to gather reliable performance data.&lt;/p&gt;

&lt;p&gt;The latest MLPerf Storage v2.0 &lt;a href="https://mlcommons.org/benchmarks/storage/" rel="noopener noreferrer"&gt;benchmark results&lt;/a&gt; from MLCommons highlight the increasingly critical role of storage performance in the scalability of AI training systems. With participation nearly doubling compared to the previous v1.0 round, the industry’s rapid innovation is evident—storage solutions now support around twice the number of accelerators as before. The new iteration includes checkpointing benchmarks, which address real-world scenarios faced by large AI clusters, where frequent hardware failures can disrupt training jobs. By simulating such events and evaluating storage recovery speeds, MLPerf Storage v2.0 offers valuable insights into how checkpointing helps ensure uninterrupted performance in sprawling datacenter environments.&lt;/p&gt;

&lt;p&gt;A broad spectrum of storage technologies took part in the benchmark—ranging from local storage, in-storage accelerators, to object stores—reflecting the diversity of approaches in AI infrastructure. Over 200 results were submitted by 26 organizations worldwide, many participating for the first time, which showcases the growing global momentum behind the MLPerf initiative. The benchmarking framework—open-source and rigorously peer-reviewed—provides unbiased, actionable data for system architects, datacenter managers, and software vendors. MLPerf Storage is a go-to resource for designing resilient, high-performance AI training systems in a rapidly evolving technology landscape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Building Your AI Cloud and HPC Strategy
&lt;/h2&gt;

&lt;p&gt;As the AI Cloud, GPU-as-a-service, and HPC landscape evolves, storage is no longer a background detail—it is the core differentiator for speed, scale, and future innovation. Vendor neutrality empowers you to architect best-of-breed systems, leveraging open-source foundations and integrating commercial solutions where they fit your needs. Every cloud or on-prem cluster will benefit from storage designed for AI and HPC, not just traditional workloads.&lt;/p&gt;

&lt;p&gt;Ready for the next step? If you want to explore options, benchmark solutions, or design an optimized AI/HPC cloud, &lt;a href="https://cal.com/cloudraft/consulting" rel="noopener noreferrer"&gt;book a meeting&lt;/a&gt; with the CloudRaft team. Our experts bring hands-on experience from enterprise projects, migration strategies, and multi-vendor deployments, helping you maximize both infrastructure and business outcomes. Read more about our &lt;a href="https://www.cloudraft.io/ai-cloud-consulting" rel="noopener noreferrer"&gt;offering&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aicloud</category>
      <category>ai</category>
      <category>infrastructure</category>
      <category>storage</category>
    </item>
    <item>
      <title>GitOps: ArgoCD vs FluxCD</title>
      <dc:creator>Unnati Mishra</dc:creator>
      <pubDate>Fri, 02 Aug 2024 10:12:24 +0000</pubDate>
      <link>https://forem.com/cloudraft/gitops-argocd-vs-fluxcd-20a4</link>
      <guid>https://forem.com/cloudraft/gitops-argocd-vs-fluxcd-20a4</guid>
      <description>&lt;h2&gt;
  
  
  Getting Started with GitOps
&lt;/h2&gt;

&lt;p&gt;In the fast-paced world of software development, organizations are constantly seeking ways to streamline processes and improve efficiency through automation. The shift from waterfall models to hyper-agile methodologies, coupled with the adoption of microservices architecture, has led to much faster software releases. GitOps has emerged as a powerful approach to enable this rapid deployment cycle, implementing a &lt;a href="https://kubernetes.io/docs/concepts/architecture/controller/" rel="noopener noreferrer"&gt;control-loop&lt;/a&gt; pattern often seen in Kubernetes.&lt;/p&gt;

&lt;p&gt;GitOps offers a more consistent and reliable way to handle infrastructure and deployment. In this blog, we'll explore what GitOps is, why it's gaining popularity among DevOps teams, and take a closer look at popular GitOps tools like Argo CD and Flux CD.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is GitOps?
&lt;/h2&gt;

&lt;p&gt;GitOps, a combination of 'Git' and 'Operations', is an approach to continuous deployment for cloud-native applications. It uses Git as the single source of truth for declarative infrastructure and applications. This means storing and managing all configuration files that describe how our application should be deployed and run in Git repositories.&lt;/p&gt;

&lt;p&gt;The core principle of GitOps is treating everything - from application code to infrastructure - as code that can be version-controlled and managed using Git. When changes are needed, instead of manually executing commands or scripts, we make changes to our Git repository. A controller then detects these changes and applies them to our infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of GitOps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consistency and Reliability&lt;/strong&gt;: With GitOps, the entire system configuration is stored in version control, providing a clear, auditable record of what should be deployed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Faster Recovery and Easier Rollbacks&lt;/strong&gt;: In case of issues, rolling back to a previous state is as simple as reverting to a previous commit in the Git history.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Security&lt;/strong&gt;: Git's central point of control allows for strict access controls and enforced code reviews before changes are applied.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improved Developer Experience&lt;/strong&gt;: Developers can use familiar Git workflows to manage infrastructure, bridging the gap between development and operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Visibility and Traceability&lt;/strong&gt;: All changes are recorded in Git, providing a clear record of who changed what and when.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Increased Automation&lt;/strong&gt;: Pushing changes to Git can automatically trigger deployments, reducing manual work and speeding up processes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Environment Consistency&lt;/strong&gt;: GitOps makes it easier to maintain consistency between different environments (development, staging, production).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Increased Productivity&lt;/strong&gt;: DORA's research suggests teams can ship 30-100 times more changes per day, increasing overall development output by 2-3 times.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Availability&lt;/strong&gt;: With all configuration data in Git, organizations can easily deploy the same Kubernetes platform across different environments, leading to better availability.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Argo CD vs Flux CD
&lt;/h2&gt;

&lt;p&gt;When implementing GitOps for Kubernetes, two popular tools stand out: Argo CD and Flux CD. Both are excellent choices, but they have some differences. Here's a comparison of their features:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Argo CD&lt;/th&gt;
&lt;th&gt;Flux CD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes-native&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI&lt;/td&gt;
&lt;td&gt;Rich web-based UI&lt;/td&gt;
&lt;td&gt;Capacitor GUI dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-tenancy&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm support&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Via Helm Operator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kustomize support&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sync Mechanism&lt;/td&gt;
&lt;td&gt;Automatic sync&lt;/td&gt;
&lt;td&gt;Controller-based sync&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback capabilities&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health status&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Relies on Kubernetes status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image Updater&lt;/td&gt;
&lt;td&gt;Add-on&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advanced Deployment Strategies&lt;/td&gt;
&lt;td&gt;Integrated with Argo rollouts&lt;/td&gt;
&lt;td&gt;Supported via Flagger&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Comparison between ArgoCD and FluxCD&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the next section, we will have a look at a quick demo of Argo CD and Flux CD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Argo CD
&lt;/h2&gt;

&lt;p&gt;In this quick demo of Argo CD we will go through the step-by-step process of Argo CD installation on kubernetes cluster. We will use Argo CD to deploy a sample guestbook application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Kubernetes cluster&lt;/p&gt;

&lt;p&gt;Kubectl installed and configured.&lt;/p&gt;

&lt;p&gt;Configuration of the git repository&lt;/p&gt;

&lt;h3&gt;
  
  
  Argo CD Installation
&lt;/h3&gt;

&lt;p&gt;To install Argo CD, we need to have a Kubernetes cluster and kubectl installed and configured. You can check out the guide to install kubectl &lt;a href="https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Create a namespace for Argo CD
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Install Argo CD
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Access the Argo CD api server
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Port-forward the Argo CD server service
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/argocd-server &lt;span class="nt"&gt;-n&lt;/span&gt; argocd 8080:443
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Get the initial password of the admin user to authenticate
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get secret argocd-initial-admin-secret &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"{.data.password}"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use this password to log into the Argo CD UI using username admin at the forwarded port on the localhost, in this example, it is &lt;a href="https://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hvao7j5i4hi500xnnfj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hvao7j5i4hi500xnnfj.png" alt="Argo CD UI" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy a sample application - guestbook
&lt;/h3&gt;

&lt;p&gt;To deploy an app, we need to create an Application object. The spec will have information such as the source of the Kubernetes manifests to deploy the application, destination Kubernetes cluster, namespace, and sync policy. You can also provide more image updater specs via annotations. In this example, we are not using an image updater.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;  
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;  
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;  
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;  
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;  
    &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/argoproj/argocd-example-apps.git&lt;/span&gt;  
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;  
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;  
    &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;  
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;  
    &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  
    &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="pi"&gt;-&lt;/span&gt;  &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Create the application
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; application.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futebr1mfm8wovweajz2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futebr1mfm8wovweajz2o.png" alt="Argo CD UI" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After applying the Argo CD application, the Argo CD controller will automatically monitor and apply the changes in the cluster. You can monitor this from the UI or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get apps &lt;span class="nt"&gt;-n&lt;/span&gt; argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Flux CD
&lt;/h2&gt;

&lt;p&gt;In this demo of Flux CD we will understand its installation. We will use flux CD to deploy the ‘fleet-infa’ repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Kubernetes Cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GitHub Personal Access Token. If you need help generating GitHub token check out this &lt;a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens" rel="noopener noreferrer"&gt;guide&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Objectives
&lt;/h3&gt;

&lt;p&gt;Bootstrap Flux CD on a Kubernetes Cluster.&lt;/p&gt;

&lt;p&gt;Deploy a sample application using Flux.&lt;/p&gt;

&lt;p&gt;Customize the application configuration through Kustomize patches.&lt;/p&gt;

&lt;h4&gt;
  
  
  Install the Flux CLI
&lt;/h4&gt;

&lt;p&gt;The Flux command-line interface (CLI) is used to bootstrap and interact with Flux CD&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://fluxcd.io/install.sh | &lt;span class="nb"&gt;sudo &lt;/span&gt;bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Export Your Credentials
&lt;/h4&gt;

&lt;p&gt;Export your GitHub personal access token and username.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export  &lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-token&amp;gt;
&lt;span class="nb"&gt;export  &lt;/span&gt;&lt;span class="nv"&gt;GITHUB_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-username&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check Your Kubernetes Cluster
&lt;/h4&gt;

&lt;p&gt;Ensure your cluster is ready for Flux by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux check &lt;span class="nt"&gt;--pre&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Flux Installation
&lt;/h3&gt;

&lt;p&gt;To bootstrap using a GitHub repository, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    flux bootstrap github &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GITHUB_USER&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--repository&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;fleet-infra &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./clusters/my-cluster &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--personal&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Clone the Git Repository
&lt;/h3&gt;

&lt;p&gt;Clone the fleet-infra repository to your local machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    git clone https://github.com/&lt;span class="nv"&gt;$GITHUB_USER&lt;/span&gt;/fleet-infra  
    &lt;span class="nb"&gt;cd &lt;/span&gt;fleet-infra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add podinfo Repository to Flux
&lt;/h3&gt;

&lt;p&gt;Create a git repository manifest pointing to the podinfo repository’s master branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    flux create &lt;span class="nb"&gt;source &lt;/span&gt;git podinfo &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://github.com/stefanprodan/podinfo &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;master &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--export&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ./clusters/my-cluster/podinfo-source.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commit and push the podinfo-source.yaml file to the fleet-infra repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    git add &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add podinfo GitRepository"&lt;/span&gt;  
    git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploy podinfo Application
&lt;/h3&gt;

&lt;p&gt;Create a Kustomization manifest to deploy the podinfo application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    flux create kustomization podinfo &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--target-namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;default &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;podinfo &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./kustomize"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--wait&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30m &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--retry-interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2m &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--health-check-timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3m &lt;span class="se"&gt;\ &lt;/span&gt; 
    &lt;span class="nt"&gt;--export&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ./clusters/my-cluster/podinfo-kustomization.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commit and push the podinfo-kustomization.yaml file to the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git add &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add podinfo Kustomization"&lt;/span&gt;  
git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Watch Flux Sync the Application
&lt;/h3&gt;

&lt;p&gt;Use the flux get command to watch the podinfo app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux get kustomizations &lt;span class="nt"&gt;--watch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verify the Deployment
&lt;/h3&gt;

&lt;p&gt;Check if podinfo has been deployed on your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get deployments,services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  GitOps best practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Git Workflows: Separate application repositories from git workflow repositories. Also, avoid using long-lived branches from different environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simplify your Kubernetes files: Use tools like Kustomize and Helm to make your Kubernetes files simpler and easier to manage. Use both together to avoid repeating yourself.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Handle secrets carefully: Do not use your passwords or secrets directly in your Git files even if they are encrypted. Instead, use tools that can fetch secrets when needed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Separate Build and Deployment Processes: Separate your build process from your deployment process. Let your CI system build and test your app and then let GitOps handle the build and put it in a server.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Ephemeral Environments using GitOps
&lt;/h2&gt;

&lt;p&gt;Ephemeral environments, also known as preview environments, are short-lived environments that allow developers to test and preview changes in a production-like environment before merging them into the main branch.&lt;/p&gt;

&lt;p&gt;These environments are typically created automatically when a pull request is opened and destroyed when the pull request is closed.&lt;/p&gt;

&lt;p&gt;In the context of Kubernetes, tools like Argo CD and Flux CD can automate the creation and management of ephemeral environments, making it easier to implement this practice in a GitOps workflow. For more information on how to implement preview environments on Kubernetes with Argo CD, check out this guide by &lt;a href="https://piotrminkowski.com/2023/06/19/preview-environments-on-kubernetes-with-argocd/" rel="noopener noreferrer"&gt;Piotr Minkowski&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;GitOps is a game-changer for managing infrastructure and applications. It boosts consistency, reliability, collaboration, and workflow. Tools like Argo CD and Flux CD exemplify how GitOps streamlines deployment and enhances efficiency. Our comparison shows the strengths and specific use cases of both tools, highlighting how they make GitOps implementation seamless and effective.&lt;/p&gt;

</description>
      <category>gitops</category>
      <category>devops</category>
    </item>
    <item>
      <title>K3s vs Talos Linux</title>
      <dc:creator>Unnati Mishra</dc:creator>
      <pubDate>Mon, 22 Jul 2024 10:40:46 +0000</pubDate>
      <link>https://forem.com/cloudraft/k3s-vs-talos-linux-2dg1</link>
      <guid>https://forem.com/cloudraft/k3s-vs-talos-linux-2dg1</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the world of Kubernetes, choosing the right technology can make a big difference in how smoothly and efficiently our applications run. This is where focused Kubernetes distributions like K3s and Talos Linux stand out.&lt;/p&gt;

&lt;p&gt;From large data centers to smaller devices on the edge, Kubernetes plays an important role in managing applications across various environments. As multiple businesses are using Kubernetes at the edge to run AI nowadays, specialized versions like K3s and Talos have come to tackle various operational challenges.&lt;/p&gt;

&lt;p&gt;K3s is known for being lightweight and easy to install, which makes it great for places with limited resources like edge computing and IoT. Meanwhile, Talos provides a more secure environment and is used for large-scale setups.&lt;/p&gt;

&lt;p&gt;In this blog, we will discuss how K3s and Talos fit into Kubernetes deployments and the differences between the two. This will help you make the perfect choice based on your needs and goals.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is K3s?
&lt;/h2&gt;

&lt;p&gt;K3s was developed by Rancher Labs and donated to the CNCF. K3s is packaged as a single, less than 40 MB, binary that reduces the dependencies and steps needed to install, run, and auto-update a production Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;It is a lightweight yet powerful Kubernetes distribution designed for production workloads across IoT devices or resource-restrained remote locations. The main aim of K3s is to streamline the installation and management of Kubernetes clusters. It is easy to install and highly available.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is K3s different from Kubernetes?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;K3s is lightweight compared to the full distribution of Kubernetes.&lt;/li&gt;
&lt;li&gt;It has fewer dependencies.&lt;/li&gt;
&lt;li&gt;It is easier to deploy and manage.&lt;/li&gt;
&lt;li&gt;It uses fewer resources (i.e. CPU, RAM, etc).&lt;/li&gt;
&lt;li&gt;It has fewer built-in features and extensions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;K3s is ideal for smaller resource-constrained deployments, edge computing, and IoT while Kubernetes is more suited for large, complex deployments that have high resource requirements such as big data, machine learning, and high-performance computing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Talos Linux?
&lt;/h2&gt;

&lt;p&gt;Talos Linux is a modern Linux operating system distribution, written in Golang, that has specifically been built for the purpose of Kubernetes infrastructure. It has been designed to serve as the foundation for Kubernetes clusters.&lt;/p&gt;

&lt;p&gt;In Talos, the cluster is accessed through APIs, which reduces the need for secure shelling (SSH) and therefore reduces the chances of surface attacks. It also helps avoid unexpected issues by creating an immutable layer on top of physical servers. This ensures that all servers are identical and have the same setup. Since it is API-managed, it makes operations automated, straightforward, and scalable.&lt;/p&gt;

&lt;p&gt;You can read more about Talos &lt;a href="https://dev.to/blog/making-kubernetes-simple-with-talos"&gt;here&lt;/a&gt; post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Differences between K3s and Talos Linux
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Talos Linux&lt;/th&gt;
&lt;th&gt;K3s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Small in size&lt;/td&gt;
&lt;td&gt;Medium in size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Role&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OS For running the Kubernetes cluster&lt;/td&gt;
&lt;td&gt;Lightweight Kubernetes distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Installation and Setup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex setup however can be simplified.&lt;/td&gt;
&lt;td&gt;Simple setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minimal, immutable OS; no SSH access or shell; API-driven configuration and management&lt;/td&gt;
&lt;td&gt;Lightweight, single-binary; integrates container runtime, networking, and storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Has a strong focus on security with an immutable file system, no interactive login (SSH), and API-driven interactions&lt;/td&gt;
&lt;td&gt;Follows essential security practices like RBAC, TLS encryption, automatic updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Requirements&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires sufficient resources for effective Kubernetes operation; not for resource-constrained environments&lt;/td&gt;
&lt;td&gt;Low resource requirements; suitable for low-power devices like IoT and edge devices.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Supports scalable Kubernetes clusters in production environments; handles large-scale deployments&lt;/td&gt;
&lt;td&gt;Supports clustering and high availability; generally used for smaller-scale deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Management and Maintenance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed through APIs; automated management with minimal manual intervention; less frequent maintenance and patching due to immutable infrastructure&lt;/td&gt;
&lt;td&gt;Simplified management with standard Kubernetes tools and interfaces; easy to update and maintain; suitable for environments requiring ease of management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Community and Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Growing community focused on security and production-grade deployments; strong documentation, community forums, and resources.&lt;/td&gt;
&lt;td&gt;Active community backed by Rancher Labs (part of SUSE); extensive documentation, community support, and commercial support options available through Rancher&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Usage of K3s and Talos Linux
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Used for lightweight and resource-constrained environments.&lt;/li&gt;
&lt;li&gt;It is perfect for edge computing, IoT, development and testing scenarios.&lt;/li&gt;
&lt;li&gt;Helps in easy management and faster deployments.&lt;/li&gt;
&lt;li&gt;Good fit for edge devices due to its security, reliability, and immutable ideology.&lt;/li&gt;
&lt;li&gt;It is an excellent option for deploying Kubernetes on bare metal servers.&lt;/li&gt;
&lt;li&gt;It is highly suitable for enterprise-level Kubernetes clusters.&lt;/li&gt;
&lt;li&gt;It supports cloud platforms and virtualization platforms as well.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The choice between K3s and Talos Linux hinges on their specific use cases and future needs. It can be observed that the demand for lightweight Kubernetes is rising significantly. Industries have started to embrace edge computing, IoT, and other resource-constrained environments, making the ability to efficiently manage applications with minimal infrastructure of extreme importance.&lt;/p&gt;

&lt;p&gt;As the demand for lightweight and efficient Kubernetes solutions grows, K3s is all-set to play a crucial role in helping in seamless and scalable application management in resource-limited environments. Meanwhile, Talos Linux will continue to be a robust choice for enterprises prioritizing security and reliability.&lt;/p&gt;

&lt;p&gt;To conclude, the choice between K3s and Talos Linux should be guided by specific deployment needs, resource availability, and security considerations. Organizations can effectively meet their Kubernetes deployment goals by understanding the strengths of each and choosing accordingly.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Expert Guide on Selecting Observability Products</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Sat, 13 Jul 2024 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/expert-guide-on-selecting-observability-products-l9d</link>
      <guid>https://forem.com/cloudraft/expert-guide-on-selecting-observability-products-l9d</guid>
      <description>&lt;h2&gt;
  
  
  Guide to select Observability tools and products
&lt;/h2&gt;

&lt;p&gt;In today's digital landscape, businesses are constantly striving to stay ahead of the curve. The ability to deliver exceptional customer experiences, maintain system reliability, and optimize performance has become a crucial differentiator. Enter observability – the linchpin of modern IT operations that empowers organizations to achieve operational excellence, drive cost-efficiency, and continuously enhance their services.&lt;/p&gt;

&lt;p&gt;The rise of cloud-native architectures has revolutionized the way applications are built and deployed. These modern systems leverage dynamic, virtualized infrastructure to provide unparalleled flexibility and automation. By enabling on-demand scaling and global accessibility, cloud-native approaches have become a catalyst for innovation and agility in the business world.&lt;/p&gt;

&lt;p&gt;However, this shift brings new challenges. Unlike traditional monolithic systems, cloud-native applications are composed of numerous microservices distributed across various teams, platforms, and geographic locations. This decentralized nature makes it increasingly complex to monitor and maintain system health effectively.&lt;/p&gt;

&lt;p&gt;In this article, we'll explore the essential characteristics of a robust observability solution and provide guidance on selecting the right tools to meet your organization's unique needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evolution in Observability Space
&lt;/h2&gt;

&lt;p&gt;The evolution of observability over the last two decades has been characterized by significant technological advancements and changing industry needs. Let's explore this journey in more detail:&lt;/p&gt;

&lt;p&gt;In the early 2000s, observability faced its first major challenge with the explosion of log data. Organizations struggled with a lack of comprehensive solutions for instrumenting, generating, collecting, and visualizing this information. This gap in the market led to the rise of Splunk, which quickly became a dominant player by offering robust log management capabilities. As the decade progressed, the rapid growth of internet-based services and distributed systems introduced new complexities. This shift necessitated more sophisticated Application Performance Management (APM) solutions, paving the way for industry leaders like DynaTrace, New Relic, and AppDynamics to emerge and address these evolving needs.&lt;/p&gt;

&lt;p&gt;The dawn of the 2010s brought about a paradigm shift with the advent of microservices architecture and cloud computing. These technologies dramatically increased the complexity of IT environments, creating a demand for observability solutions that prioritized developer experience. This wave saw the birth of innovative platforms such as DataDog, Grafana, Sentry, and &lt;a href="https://www.cloudraft.io/prometheus-consulting" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, each offering unique approaches to monitoring and visualizing system performance. As we moved into the latter half of the decade, the industry faced a new challenge: skyrocketing observability costs due to the massive ingestion of Metrics, Events, Logs, and Traces (MELT). While monitoring capabilities had greatly improved, debugging remained a largely manual and time-consuming process, especially in the face of increasingly complex Kubernetes and serverless architectures. Some products like Datadog, Grafana, SigNoz, &lt;a href="https://www.cloudraft.io/blog/cloudraft-kloudmate-partnership" rel="noopener noreferrer"&gt;KloudMate&lt;/a&gt;, Honeycomb, Kloudfuse, &lt;a href="https://www.cloudraft.io/thanos-support" rel="noopener noreferrer"&gt;Thanos&lt;/a&gt;, Coroot, and VictoriaMetrics tackled these new challenges head-on.&lt;/p&gt;

&lt;p&gt;The early to mid-2020s have ushered in a new era of observability, characterized by innovative approaches to data storage and analysis. Industry standards like OpenTelemetry have gained widespread adoption, and products are now aligning with this standard. To optimize costs, observability pipelines are being used to filter and route data to various backends, automatically handling high cardinality data that was often a pain point at scale. We've also seen the adoption of high-performance databases like &lt;a href="https://www.cloudraft.io/clickhouse-consulting" rel="noopener noreferrer"&gt;ClickHouse&lt;/a&gt; for &lt;a href="https://clickhouse.com/use-cases/logging-and-metrics" rel="noopener noreferrer"&gt;monitoring purposes&lt;/a&gt;, often becoming the backend of choice for observability products. The emergence of eBPF technology has provided deep insights into system performance and inter-entity relationships. Due to the increased adoption of the Rust programming language for its high performance, some observability tools such as Vector and various agents have become lightweight and more efficient, allowing for further scalability. Products like Quickwit (&lt;a href="https://quickwit.io/blog/quickwit-binance-story" rel="noopener noreferrer"&gt;see how Binance is storing 100PB logs&lt;/a&gt;) have introduced cost-effective and scalable solutions for storing logs and metrics directly on object storage. Perhaps most significantly, we're witnessing the integration of artificial intelligence into observability tools, enabling causal analysis and faster problem resolution. This AI-driven approach is helping organizations quickly narrow down issues in their increasingly complex environments, marking a new frontier in the observability landscape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Systems are getting Complex
&lt;/h2&gt;

&lt;p&gt;In the realm of modern, distributed systems, traditional monitoring approaches fall short. These conventional methods rely on predetermined failure scenarios, which prove inadequate when dealing with the intricate, interconnected nature of today's cloud-based architectures. The unpredictability of these complex systems demands a more sophisticated approach to observability.&lt;/p&gt;

&lt;p&gt;Enter the new generation of cloud monitoring tools. These advanced solutions are designed to navigate the labyrinth of distributed systems, drawing connections between seemingly disparate data points without the need for explicit configuration. Their power lies in their ability to uncover hidden issues and correlate information across various contexts, providing a holistic view of system health.&lt;/p&gt;

&lt;p&gt;Consider this scenario: a user reports an error in a mobile application. In a world of microservices, pinpointing the root cause can be like finding a needle in a haystack. However, with these cutting-edge monitoring tools, engineers can swiftly trace the issue back to its origin, even if it's buried deep within one of countless backend services. This capability not only accelerates root cause analysis but also significantly reduces mean time to resolution (MTTR).&lt;/p&gt;

&lt;p&gt;But the benefits don't stop at troubleshooting. These tools can play a crucial role in refining deployment strategies. By providing real-time feedback on new rollouts, they enable more sophisticated deployment techniques such as canary releases or blue-green deployments. This proactive approach allows for automatic rollbacks of problematic changes, mitigating potential issues before they impact end-users.&lt;/p&gt;

&lt;p&gt;As the cloud-native landscape continues to evolve, selecting the right monitoring stack becomes paramount. To maximize the benefits of modern observability, it's crucial to choose a solution that not only meets your current needs but also aligns with your future goals and the ever-changing demands of cloud-based architectures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Essential Features of Robust Observability Solutions
&lt;/h2&gt;

&lt;p&gt;In today's complex digital landscapes, selecting the right observability tools is crucial. Let's explore the key attributes that make an observability solution truly effective that aligns with the observability best practices.&lt;/p&gt;

&lt;h3&gt;
  
  
  Holistic Monitoring Capabilities
&lt;/h3&gt;

&lt;p&gt;A comprehensive observability platform should adeptly handle the four pillars of telemetry data, collectively known as MELT:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics: Quantitative indicators of system health, such as CPU utilization&lt;/li&gt;
&lt;li&gt;Events: Significant system occurrences or state changes&lt;/li&gt;
&lt;li&gt;Logs: Detailed records of system activities and operations&lt;/li&gt;
&lt;li&gt;Traces: Request pathways through the system, illuminating performance bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An ideal solution seamlessly integrates these data types, providing a cohesive view of your system's health.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intelligent Data Analysis and Anomaly Detection
&lt;/h3&gt;

&lt;p&gt;Modern systems often exhibit unpredictable behavior patterns, rendering static alert thresholds ineffective. Advanced observability tools employ machine learning to detect anomalies without explicit configuration, while still allowing for customization. By correlating anomalies across various telemetry types, these systems can perform automated root cause analysis, significantly reducing troubleshooting time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sophisticated Alerting and Incident Management
&lt;/h3&gt;

&lt;p&gt;Real-time alerting is the backbone of effective observability. A top-tier solution should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert on both customizable thresholds and AI-detected anomalies&lt;/li&gt;
&lt;li&gt;Consolidate related alerts into actionable incidents&lt;/li&gt;
&lt;li&gt;Enrich incidents with contextual data, runbooks, and team information&lt;/li&gt;
&lt;li&gt;Intelligently route incidents to appropriate personnel&lt;/li&gt;
&lt;li&gt;Trigger automated remediation workflows when applicable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To combat alert fatigue, the system should employ intelligent alert suppression, prioritization, and escalation mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data-Driven Insights
&lt;/h3&gt;

&lt;p&gt;Analytics derived from telemetry data drive continuous improvement. Key metrics to track include Mean Time to Repair (MTTR), Mean Time to Acknowledge (MTTA), and various Service Level Objectives (SLOs). These insights facilitate post-incident analysis, helping teams prevent future issues and optimize system performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extensive Integration Ecosystem
&lt;/h3&gt;

&lt;p&gt;A versatile observability solution should seamlessly integrate with your entire tech stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Popular programming languages and frameworks&lt;/li&gt;
&lt;li&gt;Open-source standards (OpenTelemetry, OpenMetrics, StatsD)&lt;/li&gt;
&lt;li&gt;Container orchestration platforms (Docker, Kubernetes)&lt;/li&gt;
&lt;li&gt;Security tools for vulnerability scanning&lt;/li&gt;
&lt;li&gt;Incident management systems&lt;/li&gt;
&lt;li&gt;CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Major cloud platforms&lt;/li&gt;
&lt;li&gt;Team collaboration tools&lt;/li&gt;
&lt;li&gt;Business intelligence platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability and Cost Optimization
&lt;/h3&gt;

&lt;p&gt;As applications grow in scale and complexity, managing observability costs becomes challenging. Look for tools that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify underutilized resources and forecast future needs&lt;/li&gt;
&lt;li&gt;Employ intelligent data sampling and retention policies&lt;/li&gt;
&lt;li&gt;Efficiently handle high-cardinality data&lt;/li&gt;
&lt;li&gt;Utilize cutting-edge technologies like eBPF for improved performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Intuitive User Experience
&lt;/h3&gt;

&lt;p&gt;An observability platform's UI/UX is critical for efficient debugging and insight gathering. Seek solutions offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear visualizations of system components and their relationships&lt;/li&gt;
&lt;li&gt;Pre-configured dashboards for common scenarios&lt;/li&gt;
&lt;li&gt;Easy integration with your existing stack&lt;/li&gt;
&lt;li&gt;Comprehensive, user-friendly documentation&lt;/li&gt;
&lt;li&gt;Ability to slice and dice visualizations and fast response time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operational Simplicity
&lt;/h3&gt;

&lt;p&gt;Scaling observability across an organization can be daunting. Look for platforms that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support "everything-as-code" for standardization and version control&lt;/li&gt;
&lt;li&gt;Integrate smoothly with modern application platforms&lt;/li&gt;
&lt;li&gt;Offer automation-friendly interfaces&lt;/li&gt;
&lt;li&gt;Provide tools for managing observability at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost-Effective Data Management
&lt;/h3&gt;

&lt;p&gt;As data volumes grow, intelligent data lifecycle management becomes crucial. Seek solutions offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-tiered storage for different data types&lt;/li&gt;
&lt;li&gt;Advanced compression and deduplication techniques&lt;/li&gt;
&lt;li&gt;Intelligent data sampling strategies&lt;/li&gt;
&lt;li&gt;Efficient handling of high-cardinality data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Alignment with Industry Standards
&lt;/h3&gt;

&lt;p&gt;Choosing tools that support industry-standard protocols and frameworks (like OpenTelemetry, PromQL, and Grafana) ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easier integration with existing systems&lt;/li&gt;
&lt;li&gt;Vendor-independent implementations&lt;/li&gt;
&lt;li&gt;Flexibility to change backends without code modifications&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Organizational Fit
&lt;/h3&gt;

&lt;p&gt;When selecting an observability solution, consider your organization's unique needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System complexity and scale&lt;/li&gt;
&lt;li&gt;User base characteristics&lt;/li&gt;
&lt;li&gt;Budget constraints&lt;/li&gt;
&lt;li&gt;Team skills and expertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prioritize platforms that cover your full stack, tying surface-level symptoms to root causes. Ensure the chosen solution integrates seamlessly with your current tech stack, DevSecOps processes, and team workflows. The ideal observability solution balances comprehensive insights with practical considerations, providing a powerful yet feasible tool for your organization's needs. Ideally, you want one or a few tools that are as effective as possible to justify their costs; you also want to avoid context switching. Let’s look at the key features of an effective application monitoring tool. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Selecting the ideal observability solution is a nuanced process that demands a deep understanding of your organization's unique ecosystem. It's not just about collecting data; it's about gaining actionable insights that drive meaningful improvements in your systems and processes.&lt;/p&gt;

&lt;p&gt;The journey to effective observability requires a careful balance between comprehensive coverage and practical implementation. Your chosen solution should seamlessly integrate with your existing tech stack, enhancing rather than disrupting your current workflows. It's crucial to find a tool that not only provides rich, full-stack visibility but also aligns with your team's skills, your budget constraints, and your overall operational goals.&lt;/p&gt;

&lt;p&gt;Remember, observability is a double-edged sword. When implemented effectively, it can provide unprecedented insights into your systems, enabling proactive problem-solving and continuous improvement. However, if not approached thoughtfully, it can lead to unnecessary complexity, spiraling costs, and a false sense of security. The risk of "running half blind" with suboptimal observability practices is real and can have significant implications for your operations and bottom line.&lt;/p&gt;

&lt;p&gt;In this complex landscape, partnering with experts can make all the difference. CloudRaft, with &lt;a href="https://www.cloudraft.io/observability-consulting" rel="noopener noreferrer"&gt;its deep expertise in observability&lt;/a&gt; and extensive partnerships in the field, stands ready to guide you through this journey. Our experience can help you rapidly adopt and optimize modern observability practices, ensuring you reap the full benefits of these powerful tools without falling into common pitfalls.&lt;/p&gt;

&lt;p&gt;By choosing the right observability solution and implementation approach, you're not just collecting data – you're empowering your team with the insights they need to drive innovation, enhance performance, and deliver exceptional user experiences. In today's fast-paced digital environment, that's not just an advantage – it's a necessity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authors&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anjul Sahu&lt;/strong&gt;: &lt;a href="https://www.linkedin.com/in/anjul" rel="noopener noreferrer"&gt;Anjul&lt;/a&gt; is a leading expert in observability and a thought leader. In the last one and half decades, he has seen all the waves, of how observability and monitoring have evolved in large-scale organizations such as Telcos, Banks, and Internet Startups. He also works with investors and product companies looking for advice on the current trends in observability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Madhukar Mishra&lt;/strong&gt;: &lt;a href="https://www.linkedin.com/in/madhukar-mishra-b55593b8/" rel="noopener noreferrer"&gt;Madhukar&lt;/a&gt; has over one decade of experience, building up the platform for a leading e-commerce company in India to a company that built Internet-scale products. He is interested in large-scale distributed systems and is a thought leader in developer productivity and SRE.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>observability</category>
      <category>cloudraft</category>
      <category>opentelemetry</category>
      <category>thanos</category>
    </item>
    <item>
      <title>Linux Troubleshooting For SREs</title>
      <dc:creator>Madhuri Malviya</dc:creator>
      <pubDate>Fri, 10 Nov 2023 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/linux-troubleshooting-for-sres-44fn</link>
      <guid>https://forem.com/cloudraft/linux-troubleshooting-for-sres-44fn</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;As a Linux user or administrator, understanding and mastering the art of troubleshooting is very crucial. Regardless of how well-designed and optimized your systems may be, issues are bound to arise from time to time. These can range from minor hiccups to critical problems that hinder the performance and availability of your Linux machines or containers. In this comprehensive article, we will explore real-life examples of performance issues and provide you with a collection of useful Linux commands to troubleshoot everything from CPU and IO to network and errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common causes of performance issues
&lt;/h3&gt;

&lt;p&gt;Performance issues can be caused by a variety of factors. Some common causes include insufficient memory or CPU resources, disk I/O bottlenecks, network congestion, inefficient code and bugs. In addition, misconfigurations, outdated software, and runaway or zombie processes can also impact performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  The importance of Linux troubleshooting
&lt;/h3&gt;

&lt;p&gt;By diligently troubleshooting and resolving the issues, you not only ensure the smooth operation of your systems but also minimize the MTTR (Mean Time to Repair) – the average time it takes to fix a problem.&lt;/p&gt;

&lt;p&gt;Mastering Linux troubleshooting allows you to swiftly diagnose and resolve performance bottlenecks, errors, and other issues that can potentially disrupt your operations.&lt;/p&gt;

&lt;p&gt;To efficiently diagnose and resolve problems on your Linux machines there are two widely used approaches, RED and USE methodologies.&lt;/p&gt;

&lt;h4&gt;
  
  
  RED Methodology
&lt;/h4&gt;

&lt;p&gt;The RED methodology focuses on three indicators: Rate, Errors, and Duration, especially directed at request-driven systems such as modern web applications. The idea is to resolve any performance issue and provide smooth running of the application. Let us understand it with an example. Suppose your service is becoming unresponsive, to resolve this issue you first look into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate&lt;/strong&gt;: It measures the number of requests that the service receives per unit of time. An unexpectedly high request rate could be indicative of an increased load on the server, causing performance issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error&lt;/strong&gt;: This metric tracks the number of errors that occur during the processing of requests. When dealing with a slow server, monitoring for errors is crucial in identifying any issues or bugs within the server's processing logic. It would pinpoint the root cause of inefficiency and you will be able to resolve the issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Duration&lt;/strong&gt;: It measures the time taken by the server to process each request. For a slow service, analyzing the duration helps identify the specific requests or processes that are taking longer than usual to complete. By identifying the slow-performing components, you can focus on optimizing those areas.&lt;/p&gt;

&lt;h4&gt;
  
  
  USE Methodology
&lt;/h4&gt;

&lt;p&gt;The USE methodology focuses on identifying problems with system resources while using three criteria – Utilization Saturation and Error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utilization:&lt;/strong&gt; In this, we monitor how resources are used, and whether they are being used to their fullest. High utilization of resources leads to slower performance as no more work can be accepted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saturation&lt;/strong&gt;: When a process is waiting for a resource for a long time it leads to saturation. When dealing with a slow server, monitoring for saturation helps identify any backlogs or queues that are causing delays in processing requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error&lt;/strong&gt;: it looks for system warnings that pop up on your screen that can cause your system to hang, slow down, or crash. By examining error rates and types, you can pinpoint specific areas where errors are prevalent, helping you identify the root causes of the slowdown.&lt;/p&gt;

&lt;p&gt;In the next section, we will explore real-life examples of common problems in Linux systems and walk through step-by-step solutions using powerful commands and techniques.&lt;/p&gt;

&lt;h3&gt;
  
  
  Essential troubleshooting commands and techniques
&lt;/h3&gt;

&lt;p&gt;Let's deep dive into the different scenarios in which we can use the mechanisms of troubleshooting. In this article, we are using Ubuntu OS and Intel processor, if you are on a different system or architecture, the output will be slightly different.&lt;br&gt;
Suppose you are on call and you have an incident to troubleshoot some performance issue on a Linux machine or Container. Don’t worry we got you covered for every problem you face! Here are some commands to come to your rescue.&lt;/p&gt;
&lt;h4&gt;
  
  
  top
&lt;/h4&gt;

&lt;p&gt;This command provides real-time information about system resource usage, including CPU, memory, and running processes. You might get overwhelmed as to what to look for in this output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;top - 20:31:34 up 1 day,  6:05,  1 user,  load average: 0.50, 0.65, 0.57
      Tasks:  88 total,   1 running,  87 sleeping,   0 stopped,   0 zombie
%Cpu&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;:  533.0 us,  242.0 sy,  0.0 ni, 99.6 &lt;span class="nb"&gt;id&lt;/span&gt;,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :    941.6 total,    214.6 free,    197.8 used,    529.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    577.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2180 ubuntu    20   0   17208   5580   3144 S   0.3   0.6   0:01.75 sshd
   6387 ubuntu    20   0   10776   3860   3288 R   0.3   0.4   0:00.01 top
      1 root      20   0  102004  10812   6096 S   0.0   1.1   0:05.79 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.03 kthreadd
      3 root       0 &lt;span class="nt"&gt;-20&lt;/span&gt;       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 &lt;span class="nt"&gt;-20&lt;/span&gt;       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      5 root       0 &lt;span class="nt"&gt;-20&lt;/span&gt;       0      0      0 I   0.0   0.0   0:00.00 slub_flushwq
      6 root       0 &lt;span class="nt"&gt;-20&lt;/span&gt;       0      0      0 I   0.0   0.0   0:00.00 netns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Load Average&lt;/code&gt;- indicates the average system load for past 1,5 and 15 mins, respectively.A load average of 1 represents a fully utilized single-core CPU. Higher values indicate an increasingly overloaded system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Zombie Processes&lt;/code&gt;- the dead processes whose execution is completed but are still using system resources. If it is present then there are issues with the process management.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Cpu%&lt;/code&gt;- shows the amount of CPU resources being used. It should be well below 100%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;S&lt;/code&gt;- You can check the process state. If your web server is not responding at all then you’ll see "D" (uninterruptible sleep) state indicating the process is stuck.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;RS&lt;/code&gt; (resident memory usage)- it indicates the amount of physical memory being used by the application. If the system is actively paging memory this value would be exceptionally high which indicates that the process is demanding more memory than what is physically available. This situation might lead to frequent swapping of memory between RAM and the swap space, causing a performance bottleneck.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SHR&lt;/code&gt;(shared memory)- If the value is high, it could suggest that the application is relying heavily on shared libraries or is engaging in unnecessary data sharing, leading to a resource restriction.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If there is a process that is using maximum virtual memory and because of that your system is becoming slow you can check which process has the maximum amount of virtual memory usage through the &lt;code&gt;VIRT&lt;/code&gt; parameter.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  sar
&lt;/h4&gt;

&lt;p&gt;This command collects, reports, and saves system activity information over some time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;sar &lt;span class="nt"&gt;-n&lt;/span&gt; TCP,ETCP 1
Linux 5.15.0-88-generic &lt;span class="o"&gt;(&lt;/span&gt;top-gerbil&lt;span class="o"&gt;)&lt;/span&gt;       11/07/23        _x86_64_        &lt;span class="o"&gt;(&lt;/span&gt;1 CPU&lt;span class="o"&gt;)&lt;/span&gt;

21:29:02     active/s  passive/s  iseg/s    oseg/s
21:29:03         0.00      0.00       0.00         0.00

21:29:02     atmptf/s  estres/s  retrans/s   isegerr/s       orsts/s
21:29:03         0.00      0.00      0.00         0.00              0.00

21:29:03     active/s  passive/s    iseg/s    oseg/s
21:29:04         0.00      0.00       11.00        11.00

21:29:03     atmptf/s  estres/s   retrans/s isegerr/s       orsts/s
21:29:04         0.00      0.00      0.00         0.00             0.00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You can check the Network Interface Statistics through which we can get the metrics such as bytes transmitted and received, packet counts, errors, and drops that can be useful for monitoring the network performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can also check the information regarding the process queues and scheduling activities through the PROCESS AND QUEUE STATISTIC parameter which can be used to resolve issues related to process management and scheduling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;active/s and passive/s&lt;/code&gt;- identify if your web server is not responsive. It can be because of a high no. of active connections or too many passive connections.If the &lt;code&gt;active/s&lt;/code&gt; parameter is high that can indicate that there is a sudden spike in the traffic or due to DoS attack and if the &lt;code&gt;passive/s&lt;/code&gt; is higher than usual it may mean that the incoming requests are not processed efficiently due to lack of resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;retrans/s&lt;/code&gt;&lt;/em&gt;- identify network congestion or unreliable networks like if you are experiencing slow file transfer because the network is suffering from high rates of packet loss and rectifying that can help in reducing retransmission and increase file transfer speed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;estres/s&lt;/code&gt;- shows no. of current active connections so if your server is running slow you can optimize your server's capacity by ending the connections which are not required&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;orsts/s&lt;/code&gt;- tells about the sender's retransmission rate like if it's high then it possibly due to unreliable links and this suggests that we have a low QoS.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;in-seg/s&lt;/code&gt;- tells that if the parameter is high then your server has a surge in request and this can affect your network infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  free
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;free&lt;/code&gt; command is used to display the amount of free and used memory in the system, including both physical and swap memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~&lt;span class="nv"&gt;$ &lt;/span&gt;free &lt;span class="nt"&gt;-m&lt;/span&gt;
               total        used        free      shared  buff/cache   available
Mem:            7828        1896        2996        1010        2935        4382
Swap:           16023           0       16023
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;shared&lt;/code&gt;&lt;/em&gt;- share how much memory is used by the shared libraries (it does not mean memory it refers to a specific type of software component that contains reusable code and data that multiple programs or applications can use. Shared libraries are loaded into memory when an application that depends on them is executed. If it's high then it means you may have high memory usage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;cached&lt;/code&gt;-&lt;/em&gt; indicate that memory is being used to cache frequently accessed files that means high I/O performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  vmstat
&lt;/h4&gt;

&lt;p&gt;This command is used to report virtual memory statistics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;vmstat 1
procs &lt;span class="nt"&gt;-----------memory----------&lt;/span&gt; &lt;span class="nt"&gt;---swap--&lt;/span&gt; &lt;span class="nt"&gt;-----io----&lt;/span&gt; &lt;span class="nt"&gt;-system--&lt;/span&gt; &lt;span class="nt"&gt;------cpu-----&lt;/span&gt;
 r  b   swpd   free   buff  cache   si   so    bi    bo   &lt;span class="k"&gt;in   &lt;/span&gt;cs us sy &lt;span class="nb"&gt;id &lt;/span&gt;wa st
 0  0      0 219264  20164 524172    0    0    17    28  352   55  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  346   94  0  1 99  0  0
 0  0      0 219264  20164 524176    0    0     0     0  296   82  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  321   71  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  348   83  0  1 99  0  0
 0  0      0 219264  20164 524176    0    0     0     0  324   75  0  0 100  0  0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;r&lt;/code&gt;-&lt;/em&gt; indicates the number of processes running in the CPU. If your web server is slow and the value of r is high then it indicates that there are many processes in the CPU and they are all competing for the resources for their completion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;swpd&lt;/code&gt;, &lt;code&gt;free&lt;/code&gt;, &lt;code&gt;buff&lt;/code&gt;, &lt;code&gt;cache&lt;/code&gt;, &lt;code&gt;si&lt;/code&gt;, &lt;code&gt;so&lt;/code&gt;&lt;/em&gt;- indicate the characteristics of memory such as how much memory is free or in cache and how much amount of memory is swapping in/out from the disk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;in&lt;/code&gt;and &lt;code&gt;cs&lt;/code&gt;-&lt;/em&gt; indicate the number of interrupts and context switches per second. High value of cs tells us about the frequent switches that can decrease the CPU performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;id&lt;/code&gt; and &lt;code&gt;wa&lt;/code&gt;&lt;/em&gt;- indicate the percentage for how much time the CPU is idle and time spent in waiting for I/O operation. High value of &lt;code&gt;wa&lt;/code&gt; can lead to slow CPU performance and high value of &lt;code&gt;in&lt;/code&gt; indicate that the CPU is free and we can add some processes to increase the efficiency of the CPU&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  mpstat
&lt;/h4&gt;

&lt;p&gt;You are working on a server running multiple applications that heavily rely on CPU resources. You noticed that some services are not responding as quickly as they should, and that there are occasional service disruptions.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;mpstat&lt;/code&gt; command which will display CPU usage statistics for all available processors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;mpstat &lt;span class="nt"&gt;-P&lt;/span&gt; ALL
Linux 5.15.0-88-generic &lt;span class="o"&gt;(&lt;/span&gt;top-gerbil&lt;span class="o"&gt;)&lt;/span&gt;    11/07/23        _x86_64_        &lt;span class="o"&gt;(&lt;/span&gt;1 CPU&lt;span class="o"&gt;)&lt;/span&gt;

23:44:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
23:44:28     all    0.04    0.01    0.09    0.04    0.00    0.19    0.00    0.00    0.00   99.63
23:44:28       0    0.04    0.01    0.09    0.04    0.00    0.19    0.00    0.00    0.00   99.63
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;%usr&lt;/code&gt;&lt;/em&gt;- percentage of CPU time spent on user-level processes. If this is unusually high, it would indicate that certain user applications or processes are consuming excessive CPU resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;%sys&lt;/code&gt; -&lt;/em&gt; percentage of CPU time spent on system processes. If this parameter is high, it suggests that the kernel or system services are utilizing a substantial amount of CPU time, which might point to a system-level issue.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;%iowait&lt;/code&gt;- percentage of time CPU spends waiting for I/O operations. An increased value in this might imply that the system is experiencing I/O bottlenecks or storage-related problems, resulting in the CPU waiting for I/O operations to complete.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  iostat
&lt;/h4&gt;

&lt;p&gt;You are experiencing slow disk performance, resulting in delayed read/write operations and increased latency for applications reliant on disk access.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;iostat&lt;/code&gt; to monitor the I/O performance of the system's storage devices.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;iostat &lt;span class="nt"&gt;-dx&lt;/span&gt; 5
Linux 5.15.0-88-generic &lt;span class="o"&gt;(&lt;/span&gt;top-gerbil&lt;span class="o"&gt;)&lt;/span&gt;    11/08/23        _x86_64_        &lt;span class="o"&gt;(&lt;/span&gt;1 CPU&lt;span class="o"&gt;)&lt;/span&gt;

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
loop0            0.00      0.02     0.00   0.00    0.65     8.97    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop1            0.00      0.01     0.00   0.00    1.95    16.89    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop2            0.01      0.47     0.00   0.00    0.83    43.30    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     1.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sda              0.25     15.21     0.04  14.47    1.67    60.24    0.31     25.99     0.47  60.09    5.01    83.61    0.00      0.00     0.00   0.00    0.00     0.00    0.07    3.38    0.00   0.12
sr0              0.00      0.00     0.00   0.00    0.70     2.92    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Parameters like &lt;code&gt;Blk_read/s&lt;/code&gt;,&lt;code&gt;Blk_wrtn/s&lt;/code&gt;,&lt;code&gt;kB_read/s&lt;/code&gt;, &lt;code&gt;kB_wrtn/s&lt;/code&gt; are used in maintaining record of reading and writing per second.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;avgqu-sz&lt;/code&gt;&lt;/em&gt;- indicates the average number of requests made to the system. If it's greater than 1 then it can lead to saturation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  df
&lt;/h4&gt;

&lt;p&gt;You received an alert that your disk partition is full and the system is becoming unresponsive.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;df&lt;/code&gt; command to display information about the disk space usage of file systems. It provides an overview of available, used, and total disk space, as well as the mounted file systems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-kh&lt;/span&gt;
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            97M  1.2M   96M   2% /run
/dev/sda1       4.7G  2.3G  2.4G  50% /
tmpfs           482M     0  482M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15       98M  5.1M   93M   6% /boot/efi
tmpfs            97M  4.0K   97M   1% /run/user/1000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Filesystem&lt;/code&gt;- identifies which partition or filesystem is associated with the network. If a user is unable to save a file on a network share the user can take the help of this parameter.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;Used&lt;/code&gt;&lt;/em&gt;- indicate how much space is currently in use, if it’s close to the storage capacity you need to free up space.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  ifconfig
&lt;/h4&gt;

&lt;p&gt;You need to troubleshoot network connectivity issues on a Linux server.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ifconfig&lt;/code&gt; command would show you all the configured network interfaces and ip address.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;ifconfig
docker0: &lt;span class="nv"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4099&amp;lt;UP,BROADCAST,MULTICAST&amp;gt;  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:96ff:fed3:d4d2  prefixlen 64  scopeid 0x20&amp;lt;&lt;span class="nb"&gt;link&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
        ether 02:42:96:d3:d4:d2  txqueuelen 0  &lt;span class="o"&gt;(&lt;/span&gt;Ethernet&lt;span class="o"&gt;)&lt;/span&gt;
        RX packets 4128  bytes 183296 &lt;span class="o"&gt;(&lt;/span&gt;183.2 KB&lt;span class="o"&gt;)&lt;/span&gt;
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6243  bytes 85480503 &lt;span class="o"&gt;(&lt;/span&gt;85.4 MB&lt;span class="o"&gt;)&lt;/span&gt;
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;inet&lt;/code&gt; and &lt;code&gt;inet6&lt;/code&gt;&lt;/em&gt; show us the ipv4 and ipv6 addresses that are attached to the interface for the connection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;bcast&lt;/code&gt;&lt;/em&gt; helps us to identify if there is a broadcasting-related issue like if there is an issue with the broadcast address that can lead to problems like network discovery.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;mask&lt;/code&gt; (netmask) tells us about the subnet-related problems that can lead to communication issues between devices on different subnets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;mtu&lt;/code&gt;&lt;/em&gt; tells us about the maximum transmission unit i.e the maximum packet size that can be transferred in the transmission. If not it can lead to fragmentation and ultimately affecting the performance&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  dmesg
&lt;/h4&gt;

&lt;p&gt;A Linux server is experiencing hardware issues, such as disk errors or network interface failures. You must analyze the system logs to identify any potential hardware-related errors or warnings.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dmesg&lt;/code&gt; command provides information about hardware devices, system events, and potential issues encountered during system operation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dmesg
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] Linux version 5.15.0-88-generic &lt;span class="o"&gt;(&lt;/span&gt;buildd@lcy02-amd64-058&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;gcc &lt;span class="o"&gt;(&lt;/span&gt;Ubuntu 11.4.0-1ubuntu1~22.04&lt;span class="o"&gt;)&lt;/span&gt; 11.4.0, GNU ld &lt;span class="o"&gt;(&lt;/span&gt;GNU Binutils &lt;span class="k"&gt;for &lt;/span&gt;Ubuntu&lt;span class="o"&gt;)&lt;/span&gt; 2.38&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c"&gt;#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 (Ubuntu 5.15.0-88.98-generic 5.15.126)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] Command line: &lt;span class="nv"&gt;BOOT_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/boot/vmlinuz-5.15.0-88-generic &lt;span class="nv"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;UUID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5a569d86-b935-46dd-ae79-7a72a25b6a4c ro &lt;span class="nv"&gt;console&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tty1 &lt;span class="nv"&gt;console&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ttyS0
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] KERNEL supported cpus:
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000]   Intel GenuineIntel
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000]   AMD AuthenticAMD
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-provided physical RAM map:
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-e820: &lt;span class="o"&gt;[&lt;/span&gt;mem 0x0000000000000000-0x000000000009ffff] usable
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-e820: &lt;span class="o"&gt;[&lt;/span&gt;mem 0x0000000000100000-0x000000003e1b6fff] usable
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-e820: &lt;span class="o"&gt;[&lt;/span&gt;mem 0x000000003e1b7000-0x000000003e1fffff] reserved
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-e820: &lt;span class="o"&gt;[&lt;/span&gt;mem 0x000000003e200000-0x000000003eceefff] &lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-e820: &lt;span class="o"&gt;[&lt;/span&gt;mem 0x000000003f36b000-0x000000003ffeffff] reserved
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] BIOS-e820: &lt;span class="o"&gt;[&lt;/span&gt;mem 0x00000000ffc00000-0x00000000ffffffff] reserved
&lt;span class="o"&gt;[&lt;/span&gt;    0.000000] NX &lt;span class="o"&gt;(&lt;/span&gt;Execute Disable&lt;span class="o"&gt;)&lt;/span&gt; protection: active
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By analyzing the logs, you can identify any hardware issues or error messages that could be affecting the server's performance and stability.&lt;/p&gt;

&lt;h4&gt;
  
  
  journalctl
&lt;/h4&gt;

&lt;p&gt;This command displays the system call logs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;journalctl
Nov 01 17:15:42 ubuntu kernel: Linux version 5.15.0-87-generic &lt;span class="o"&gt;(&lt;/span&gt;buildd@lcy02-amd64-011&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;gcc &lt;span class="o"&gt;(&lt;/span&gt;Ubuntu 11.4.0-1ubuntu1~22.04&lt;span class="o"&gt;)&lt;/span&gt; 11.4.0, GNU ld &lt;span class="o"&gt;(&lt;/span&gt;GNU Binutils&amp;gt;
Nov 01 17:15:42 ubuntu kernel: Command line: &lt;span class="nv"&gt;BOOT_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/boot/vmlinuz-5.15.0-87-generic &lt;span class="nv"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;LABEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cloudimg-rootfs ro &lt;span class="nv"&gt;console&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tty1 &lt;span class="nv"&gt;console&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ttyS0
Nov 01 17:15:42 ubuntu kernel: KERNEL supported cpus:
Nov 01 17:15:42 ubuntu kernel:   Intel GenuineIntel
Nov 01 17:15:42 ubuntu kernel:   AMD AuthenticAMD
Nov 01 17:15:42 ubuntu kernel:   Hygon HygonGenuine
Nov 01 17:15:42 ubuntu kernel: secureboot: Secure boot disabled
Nov 01 17:15:42 ubuntu kernel: SMBIOS 2.5 present.
Nov 01 17:15:42 ubuntu kernel: DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Nov 01 17:15:42 ubuntu kernel: Hypervisor detected: KVM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suppose a service is failing to start on your Linux system, by using &lt;code&gt;journalctl&lt;/code&gt; command you can get detailed log information about the service's attempts to start and any associated error messages.&lt;/p&gt;

&lt;h4&gt;
  
  
  nicstat
&lt;/h4&gt;

&lt;p&gt;It offers comprehensive statistics on network interfaces, including data on failures, packets, and bandwidth usage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;nicstat
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
01:52:51       lo    0.00    0.00    0.01    0.01   93.01   93.01  0.00   0.00
01:52:51   enp0s3    4.72    0.05    3.33    0.29  1451.0   175.4  0.00   0.00
01:52:51  docker0    0.00    0.68    0.03    0.05   44.40 13692.2  0.00   0.00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;time&lt;/code&gt;-&lt;/em&gt; tell us about the timestamp of each network statistics so if we encounter a network issue suddenly by the help of timestamp we can identify the patterns or potential triggers that led to the problem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;name&lt;/code&gt;-&lt;/em&gt; tell the name of each network interface we are using in a multi network environment so you can pinpoint the network which is causing the problem and resolve it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;kbps in, kbps out, pkt/s in, pkt/s out, err/s in, err/s out, drops/s in, drops/s out, missed/s in, missed/s out&lt;/code&gt; - indicates the information of data or packets transferred and received, how many packets are dropped and missed, how many damaged packets received and transmitted.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;code&gt;queue in, queue out&lt;/code&gt;&lt;/em&gt; tell us about the packet queuing like if the numbers are higher than usual this can lead to latency and we can say lag in the network.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  lsof
&lt;/h4&gt;

&lt;p&gt;A file is continuously growing in size which was not expected. You need to identify which process is writing into the file.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;lsof&lt;/code&gt; command gives a list of files that are opened.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lsof

ubuntu@top-gerbil:/&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;lsof &lt;span class="nt"&gt;-R&lt;/span&gt;
COMMAND    PID  TID TASKCMD   PPID       USER   FD      TYPE     DEVICE SIZE/OFF   NODE NAME
systemd      1                  0       root  cwd       DIR      8,1     4096       2    /
systemd      1                  0       root  rtd       DIR      8,1     4096       2    /
systemd      1                  0      root  txt       REG      8,1    1849992    3335 /usr/lib/systemd/system
container 3243                  1       root  txt       REG      8,1    52632728   39545 /usr/bin/containerd
container 3243                  1       root  mem-W     REG      8,1    32768      73792 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PID&lt;/code&gt;- This would give you the process ID associated with the files. You can then resolve the issue by monitoring the process.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;USER&lt;/code&gt;- indicate who has accessed the files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  pstack
&lt;/h4&gt;

&lt;p&gt;You have an application running on your Linux system that suddenly becomes unresponsive or experiences a segmentation fault. The issue might be related to the application’s call stack.&lt;/p&gt;

&lt;p&gt;Use the &lt;code&gt;pstack&lt;/code&gt; command along with the process ID of the running process or the core dump file generated during the crash.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pstack 432

Thread 1 &lt;span class="o"&gt;(&lt;/span&gt;Thread 0x7f7f03600700 &lt;span class="o"&gt;(&lt;/span&gt;LWP 6516&lt;span class="o"&gt;))&lt;/span&gt;:
&lt;span class="c"&gt;#0  0x00007f7f0576b9d5 in poll () from /lib64/libc.so.6&lt;/span&gt;
&lt;span class="c"&gt;#1  0x00007f7f06f47b36 in ?? () from /usr/lib64/libglib-2.0.so.0&lt;/span&gt;
&lt;span class="c"&gt;#2  0x00007f7f06f47c1a in g_main_context_iteration () from /usr/lib64/libglib-2.0.so.0&lt;/span&gt;
&lt;span class="c"&gt;#3  0x00007f7f073d587d in ?? () from /usr/lib64/libgio-2.0.so.0&lt;/span&gt;
&lt;span class="c"&gt;#4  0x00007f7f06f6f16d in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0&lt;/span&gt;
&lt;span class="c"&gt;#5  0x00007f7f07471d7a in ?? () from /usr/lib64/libgio-2.0.so.0&lt;/span&gt;
&lt;span class="c"&gt;#6  0x00007f7f06f1e82f in ?? () from /usr/lib64/libglib-2.0.so.0&lt;/span&gt;
&lt;span class="c"&gt;#7  0x00007f7f0692fdd5 in start_thread () from /lib64/libpthread.so.0&lt;/span&gt;
&lt;span class="c"&gt;#8  0x00007f7f0577703d in clone () from /lib64/libc.so.6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It provides you with the stack trace of the application, displaying the function calls and corresponding memory addresses at the time of the crash. By analyzing the stack trace, you can identify the specific function or module causing the issue and gain insights into the application's behavior leading up to the crash.&lt;/p&gt;

&lt;h4&gt;
  
  
  strace
&lt;/h4&gt;

&lt;p&gt;It collects all system calls made by a process and the signals received by the process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;strace &lt;span class="nt"&gt;-p&lt;/span&gt; 5647

openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/proc/self/mountinfo"&lt;/span&gt;, O_RDONLY&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
newfstatat&lt;span class="o"&gt;(&lt;/span&gt;3, &lt;span class="s2"&gt;""&lt;/span&gt;, &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;st_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;S_IFREG|0444, &lt;span class="nv"&gt;st_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, ...&lt;span class="o"&gt;}&lt;/span&gt;, AT_EMPTY_PATH&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 0
&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;3, &lt;span class="s2"&gt;"23 29 0:21 / /sys rw,nosuid,node"&lt;/span&gt;..., 1024&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 1024
&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;3, &lt;span class="s2"&gt;"rmware/efi/efivars rw,nosuid,nod"&lt;/span&gt;..., 1024&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 1024
&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;3, &lt;span class="s2"&gt;"re20/2015 ro,nodev,relatime shar"&lt;/span&gt;..., 1024&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 973
&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;3, &lt;span class="s2"&gt;""&lt;/span&gt;, 1024&lt;span class="o"&gt;)&lt;/span&gt;                       &lt;span class="o"&gt;=&lt;/span&gt; 0
lseek&lt;span class="o"&gt;(&lt;/span&gt;3, 0, SEEK_CUR&lt;span class="o"&gt;)&lt;/span&gt;                   &lt;span class="o"&gt;=&lt;/span&gt; 3021
close&lt;span class="o"&gt;(&lt;/span&gt;3&lt;span class="o"&gt;)&lt;/span&gt;                                &lt;span class="o"&gt;=&lt;/span&gt; 0
ioctl&lt;span class="o"&gt;(&lt;/span&gt;1, TCGETS, &lt;span class="o"&gt;{&lt;/span&gt;B38400 opost isig icanon &lt;span class="nb"&gt;echo&lt;/span&gt; ...&lt;span class="o"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 0
newfstatat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/run"&lt;/span&gt;, &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;st_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;S_IFDIR|0755, &lt;span class="nv"&gt;st_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;980, ...&lt;span class="o"&gt;}&lt;/span&gt;, 0&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 0
newfstatat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/"&lt;/span&gt;, &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;st_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;S_IFDIR|0755, &lt;span class="nv"&gt;st_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4096, ...&lt;span class="o"&gt;}&lt;/span&gt;, 0&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 0
newfstatat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/sys/kernel/security"&lt;/span&gt;, &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;st_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;S_IFDIR|0755, &lt;span class="nv"&gt;st_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, ...&lt;span class="o"&gt;}&lt;/span&gt;, 0&lt;span class="o"&gt;)=&lt;/span&gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;open&lt;/code&gt; system call - identifies problems related to accessibility of a file. By examining the file paths referenced in the open system calls, you can determine if there are any file-related issues contributing to the application's failure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;read&lt;/code&gt; and &lt;code&gt;write&lt;/code&gt; system calls - offer information about data read from or written to specific files or resources. If there are any issues related to reading or writing data, these system calls can pinpoint where the problem lies, such as incorrect data handling or file manipulation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;errno&lt;/code&gt;- it displays the number of errors generated during the system calls. Analyze the type of error encountered such as file not found, permission denied, or invalid argument.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;eBPF Performance tools&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;There's so much buzz in the market about BPF. Big companies like Meta and Amazon are using it. But what exactly is BPF? It stands for Berkeley Packet Filters, originally used for monitoring specific network traffic. But it's evolved into extended BPF that's like a magic wand for Linux, enabling tracing complex issues.&lt;/p&gt;

&lt;p&gt;Brendan Gregg, a former Netflix member, shares his troubleshooting expertise in his book &lt;strong&gt;&lt;a href="https://www.brendangregg.com/bpf-performance-tools-book.html" rel="noopener noreferrer"&gt;BPF Performance Tools&lt;/a&gt;&lt;/strong&gt;, which is a must-read. He breaks down performance into different domains, offering practical examples of BPF in action using BCC and bpftrace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In this article, we saw that it is imperative to troubleshoot using a methodological approach as it offers an organized and systematic means of locating issues and resolving them quickly and effectively. Without a systematic method, troubleshooting can become disorganized and time-consuming, which frequently results in frustration, trial-and-error fixes, and, in certain situations, worsening the problems.&lt;/p&gt;

&lt;p&gt;Tired of sifting through convoluted outputs? I would highly recommend you to explore the alternative tools suggested by Julia Evans in her article &lt;a href="https://jvns.ca/blog/2022/04/12/a-list-of-new-ish--command-line-tools/" rel="noopener noreferrer"&gt;A list of new(ish) command line tools&lt;/a&gt; to optimize your workflow. For instance,&lt;code&gt;angle grinder&lt;/code&gt; outperforms traditional data analysis methods (&lt;code&gt;grep&lt;/code&gt;), with precise and efficient results.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://www.youtube.com/watch?v=Wb_vD3XZYOA" rel="noopener noreferrer"&gt;official documentary&lt;/a&gt; on the groundbreaking &lt;strong&gt;eBPF technology&lt;/strong&gt;, highlighting its impact on the Linux Kernel and its journey of development with key industry players, including Meta, Intel, Isovalent, Google, Red Hat, and Netflix.&lt;/p&gt;

&lt;p&gt;If you are stuck with the Linux issue and looking for SREs to troubleshoot, &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;contact us&lt;/a&gt; for quick support.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Multi-tenancy in Kubernetes using Vcluster</title>
      <dc:creator>Pavan Shiraguppi</dc:creator>
      <pubDate>Thu, 24 Aug 2023 09:40:18 +0000</pubDate>
      <link>https://forem.com/cloudraft/multi-tenancy-in-kubernetes-using-vcluster-2ib9</link>
      <guid>https://forem.com/cloudraft/multi-tenancy-in-kubernetes-using-vcluster-2ib9</guid>
      <description>&lt;p&gt;Kubernetes has revolutionized how organizations deploy and manage containerized applications, making it easier to orchestrate and scale applications across clusters. However, running multiple heterogeneous workloads on a shared Kubernetes cluster comes with challenges like resource contention, security risks, lack of customization, and complex management.&lt;/p&gt;

&lt;p&gt;There are several approaches to implementing isolation and multi-tenancy within Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes namespaces&lt;/strong&gt;: Namespaces allow some isolation by dividing cluster resources between different users. However, namespaces share the same physical infrastructure and kernel resources. So there are limits to isolation and customization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes distributions&lt;/strong&gt;: Popular Kubernetes distributions like &lt;a href="https://www.redhat.com/en/technologies/cloud-computing/openshift" rel="noopener noreferrer"&gt;Red Hat OpenShift&lt;/a&gt; and &lt;a href="https://www.rancher.com/" rel="noopener noreferrer"&gt;Rancher&lt;/a&gt; support virtual clusters. These leverage Kubernetes-native capabilities like namespaces, RBAC, and network policies more efficiently. Other benefits include centralized control planes, pre-configured cluster templates, and easy-to-use management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical namespaces&lt;/strong&gt;: In a traditional Kubernetes cluster, each namespace is independent of the others. This means that users and applications in one namespace cannot access resources in another namespace unless they have explicit permissions. Hierarchical namespaces solve this problem by allowing you to define a parent-child relationship between namespaces. This means that a user or application with permissions in the parent namespace will automatically have permissions in all of the child namespaces. This makes it much easier to manage permissions across multiple namespaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vcluster project&lt;/strong&gt;: The virtual cluster (vcluster) project addresses these pain points by dividing a physical Kubernetes cluster into multiple isolated software-defined clusters. vcluster allows organizations to provide development teams, applications, and customers with dedicated Kubernetes environments with guaranteed resources, security policies, and custom configurations.
This post will dive deep into vcluster - its capabilities, different implementation options, use cases, and challenges. We will also look into the best practices for maximizing utilization and simplifying the management of vcluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  What is Vcluster?
&lt;/h1&gt;

&lt;p&gt;vcluster is an open-source tool that allows you to create and manage virtual Kubernetes clusters. A virtual Kubernetes cluster is a fully functional Kubernetes cluster that runs on top of another Kubernetes cluster. vcluster works by creating a virtual cluster inside a namespace of the underlying Kubernetes cluster. The virtual cluster has its own control plane, but it shares the worker nodes and networking of the underlying cluster. This makes vcluster a lightweight solution that can be deployed on any Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;When you create a vcluster, you specify the number of worker nodes that you want the virtual cluster to have. The vcluster CLI will then create the virtual cluster and start the control plane pods on the worker nodes. You can then deploy workloads to the virtual cluster using the kubectl CLI.&lt;/p&gt;

&lt;p&gt;You can learn more about vcluster on the vcluster &lt;a href="https://vcluster.com" rel="noopener noreferrer"&gt;website&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Benefits of Using Vcluster
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Resource Isolation
&lt;/h2&gt;

&lt;p&gt;vcluster allows you to allocate a portion of the central cluster's resources like CPU, memory, and storage to individual virtual clusters. This prevents noisy neighbor issues when multiple teams share the same physical cluster. Critical workloads can be assured of the resources they need without interference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access Control
&lt;/h2&gt;

&lt;p&gt;With vcluster, access policies can be implemented at the virtual cluster level, ensuring only authorized users have access. For example, sensitive workloads like financial applications can run in an isolated vcluster. Restricting access is much simpler compared to namespace-level policies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F839d230e5e7af9a310459ea7ae559f9bf81dcef4%2Fded1f%2Fdocs%2Fmedia%2Fdiagrams%2Fvcluster-architecture.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F839d230e5e7af9a310459ea7ae559f9bf81dcef4%2Fded1f%2Fdocs%2Fmedia%2Fdiagrams%2Fvcluster-architecture.svg" alt="vcluster architecture"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source: &lt;a href="https://www.vcluster.com/docs/architecture/basics" rel="noopener noreferrer"&gt;Basics | vcluster docs | Virtual Clusters for&lt;br&gt;
Kubernetes&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Customization
&lt;/h2&gt;

&lt;p&gt;vcluster allows extensive customization for individual teams' needs - different Kubernetes versions, network policies, ingress rules, and resource quotas can be defined. Developers can have permission to modify their vcluster without impacting others.&lt;/p&gt;
&lt;h2&gt;
  
  
  Multitenancy
&lt;/h2&gt;

&lt;p&gt;Organizations often need to provide Kubernetes access to multiple internal teams or external customers. vcluster makes multi-tenancy easy to implement by creating separate isolated environments in the same physical cluster. Refer to this article for more information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Frafay.co%2Fwp-content%2Fuploads%2F2023%2F03%2F1674585426944-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Frafay.co%2Fwp-content%2Fuploads%2F2023%2F03%2F1674585426944-1.png" alt="vcluster architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://rafay.co/the-kubernetes-current/key-considerations-when-implementing-virtual-kubernetes-clusters/" rel="noopener noreferrer"&gt;Implementing Virtual Kubernetes Clusters | Rafay&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Easy Scaling
&lt;/h2&gt;

&lt;p&gt;Additional vcluster can be quickly spun up or down to handle dynamic workloads and scale requirements. New development and testing environments can be provisioned instantly without having to scale the entire physical cluster.&lt;/p&gt;
&lt;h1&gt;
  
  
  Workload Isolation Approaches Before vcluster
&lt;/h1&gt;

&lt;p&gt;Organizations have leveraged various Kubernetes native features to enable some workload isolation before virtual clusters emerged as a solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Namespaces&lt;/strong&gt; - Namespaces segregate cluster resources between different teams or applications. They provide basic isolation via resource quotas and network policies. However, there is no hypervisor-level isolation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Policies&lt;/strong&gt; - Granular network policies restrict communication between pods and namespaces. This creates network segmentation between workloads. However, resource contention can still occur.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Taints and Tolerations&lt;/strong&gt; - Applying taints to nodes prevents specified pods from scheduling onto them. Pods must have tolerances to match taints. This enables restricting pods to certain nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Virtual Networks&lt;/strong&gt; - On public clouds, using multiple virtual networks helps isolate Kubernetes cluster traffic. But pods within a cluster can still communicate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third-Party Network Plugins&lt;/strong&gt; - CNI plugins like Calico, Weave, and Cilium enable building overlay networks and fine-grained network policies to segregate traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom Controllers&lt;/strong&gt; - Developing custom Kubernetes controllers allows programmatically isolating resources. But this requires significant programming expertise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Demo of vcluster
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Install vcluster CLI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubectl (check via kubectl version)&lt;/li&gt;
&lt;li&gt;helm v3 (check with helm version)&lt;/li&gt;
&lt;li&gt;a working kube-context with access to a Kubernetes cluster (check with kubectl get namespaces)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use the following command to download the vcluster CLI binary for arm64-based Ubuntu machines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; vcluster &lt;span class="s2"&gt;"https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-arm64"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo install&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; 0755 vcluster /usr/local/bin &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To confirm that vcluster CLI is successfully installed, test via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For installations on other machines, please refer to the following link.&lt;br&gt;
&lt;a href="https://www.vcluster.com/docs/getting-started/setup" rel="noopener noreferrer"&gt;Install vcluster CLI&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Deploy vcluster
&lt;/h2&gt;

&lt;p&gt;Let's create a virtual cluster my-first-vcluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster create my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Connection to the vcluster
&lt;/h2&gt;

&lt;p&gt;To connect to the vcluster enter the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster connect my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use kubectl command to get the namespaces in the connected vcluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get namespaces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deploy an application to the vcluster
&lt;/h2&gt;

&lt;p&gt;Now let's deploy a sample nginx deployment inside the vcluster. To create a deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace demo-nginx
kubectl create deployment nginx-deployment &lt;span class="nt"&gt;-n&lt;/span&gt; demo-nginx &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will isolate the application in a namespace demo-nginx inside the vcluster.&lt;/p&gt;

&lt;p&gt;You can check that this demo deployment will create pods inside the vcluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; demo-nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check deployments from the host cluster
&lt;/h2&gt;

&lt;p&gt;Now that we have confirmed the deployments in the vcluster, let us now try to check the deployments from the host cluster.&lt;/p&gt;

&lt;p&gt;To disconnect from the vcluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster disconnect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will move the kube context back to the host cluster. Now let us check if there are any deployments available in the host cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get deployments &lt;span class="nt"&gt;-n&lt;/span&gt; vcluster-my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There will be no resources found in the &lt;em&gt;vcluster-my-vcluster&lt;/em&gt; namespace. This is because the deployment is isolated in the vcluster that is not accessible from other clusters.&lt;/p&gt;

&lt;p&gt;Now let us check if any pods are running in all of the namespaces using the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; vcluster-my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voila! We can now see that the nginx container is running in the &lt;em&gt;vcluster&lt;/em&gt; namespace.&lt;/p&gt;

&lt;h1&gt;
  
  
  Vcluster Use Cases
&lt;/h1&gt;

&lt;p&gt;Virtual clusters enable several important use cases by providing isolated and customizable Kubernetes environments within a single physical cluster. Let's explore some of these in more detail:&lt;/p&gt;

&lt;h2&gt;
  
  
  Development and Testing Environments
&lt;/h2&gt;

&lt;p&gt;Allocating dedicated virtual clusters for developer teams allows them to fully control the configuration without affecting production workloads or other developers.&lt;br&gt;
Teams can customize their vclusters with required Kubernetes versions, network policies, resource quotas, and access controls. Development teams can rapidly spin up and tear down vclusters to test different configurations.&lt;br&gt;
Since vclusters provide guaranteed compute and storage resources, developers don't have to compete. They also won't impact the performance of applications running in other vclusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Application Isolation
&lt;/h2&gt;

&lt;p&gt;Enterprise applications like ERP, CRM, and financial systems require predictable performance, high availability, and strict security. Dedicated vclusters allow these production workloads to operate unaffected by other applications.&lt;br&gt;
Mission-critical applications can be allocated reserved capacity to avoid resource contention. Custom network policies guarantee isolation. Vclusters also allow granular role-based access control to meet regulatory compliance needs.&lt;br&gt;
Rather than overprovisioning large clusters to avoid interference, vclusters provide guaranteed resources at a lower cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multitenancy
&lt;/h2&gt;

&lt;p&gt;Service providers and enterprises with multiple business units often need to securely provide Kubernetes access to different internal teams or external customers.&lt;br&gt;
vclusters simplify multi-tenancy by creating separate self-service environments for each tenant with appropriate resource limits and access policies applied. Providers can easily onboard new customers by spinning up additional vclusters.&lt;br&gt;
This removes noisy neighbor issues and allows a high density of workloads by packing vclusters according to actual usage rather than peak needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Regulatory Compliance
&lt;/h2&gt;

&lt;p&gt;Heavily regulated industries like finance and healthcare have strict security and compliance requirements around data privacy, geography, and access controls.&lt;br&gt;
Dedicated vclusters with internal network segmentation, role-based access control, and resource isolation make it easier to host compliant workloads safely alongside other applications in the same cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Temporary Resources
&lt;/h2&gt;

&lt;p&gt;vclusters allow instantly spinning up temporary Kubernetes environments to handle use cases like&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Testing cluster upgrades&lt;/strong&gt; - New Kubernetes versions can be deployed to lower environments with no downtime or impact on production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluating new applications&lt;/strong&gt; - Applications can be deployed into disposable vclusters instead of shared dev clusters to prevent conflicts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity spikes&lt;/strong&gt; - New vclusters provide burst capacity for traffic spikes versus overprovisioning the entire cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Special events&lt;/strong&gt; - vClusters can be created temporarily for workshops, conferences, and other events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the need is over, these vclusters can simply be deleted with no lasting footprint on the cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workload Consolidation
&lt;/h2&gt;

&lt;p&gt;As organizations scale their Kubernetes footprint, there is a need to consolidate multiple clusters onto shared infrastructure without interfering with existing applications.&lt;br&gt;
Migrating applications into vclusters provides logical isolation and customization allowing them to run seamlessly alongside other workloads. This improves utilization and reduces operational overhead.&lt;br&gt;
vclusters allow enterprise IT to provide a consistent Kubernetes platform across the organization while preserving isolation.&lt;br&gt;
In summary, vclusters are an essential tool for optimizing Kubernetes environments via workload isolation, customization, security, and density. The use cases highlight how they benefit diverse needs from developers to Ops to business units within an organization.&lt;/p&gt;

&lt;h1&gt;
  
  
  Challenges with vclusters
&lt;/h1&gt;

&lt;p&gt;While delivering significant benefits, some downsides to weigh includes:&lt;/p&gt;

&lt;h2&gt;
  
  
  Complexity
&lt;/h2&gt;

&lt;p&gt;Managing multiple virtual clusters, albeit smaller ones, introduces more operational overhead compared to a single large Kubernetes cluster.&lt;br&gt;
Additional tasks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provisioning and configuring multiple control planes&lt;/li&gt;
&lt;li&gt;Applying security policies and access controls consistently across vclusters&lt;/li&gt;
&lt;li&gt;Monitoring and logging across vclusters&lt;/li&gt;
&lt;li&gt;Maintaining designated resources and capacity for each vcluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, a cluster administrator has to configure and update RBAC policies across 20 vclusters rather than a single cluster. This takes more effort compared to the centralized management of a single cluster. The static IP addresses and ports on Kubernetes might cause conflicts or errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource allocation and management
&lt;/h2&gt;

&lt;p&gt;Balancing the resource consumption and performance of vclusters can be tricky, as they may have different demands or expectations.&lt;/p&gt;

&lt;p&gt;For example, vclusters may need to scale up or down depending on the workload or share resources with other vclusters or namespaces. A vcluster sized for an application's peak demand may have excess unused capacity during non-peak periods that sits idle and cannot be leveraged by other vclusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limited Customization
&lt;/h2&gt;

&lt;p&gt;The ability to customize vclusters varies across implementations. Namespaces offer the least flexibility, while Cluster API provides the most. Tools like OpenShift balance customization with simplicity.&lt;br&gt;
For example, namespaces cannot run different Kubernetes versions or network plugins. The Cluster API allows full customization but with more complexity.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Vcluster empowers Kubernetes users to customize, isolate and scale workloads within a shared physical cluster. By allocating dedicated control plane resources and access policies, vclusters provide strong technical isolation. For use cases like multitenancy, vclusters deliver simplified and more secure Kubernetes management.&lt;/p&gt;

&lt;p&gt;Vcluster can also be used to reduce Kubernetes cost overhead and can be used for ephemeral environments.&lt;br&gt;
Tools like OpenShift, Rancher, and Kubernetes Cluster API make deploying and managing vclusters much easier. As adoption increases, we can expect more innovations in the vcluster space to further simplify operations and maximize utilization. While vclusters have some drawbacks, for many organizations the benefits outweigh the added complexity.&lt;/p&gt;

&lt;p&gt;We are working on some exciting projects using vcluster to build a large scale system. Feel free to &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;contact us&lt;/a&gt; to discuss how to use vcluster for your usecase.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>vcluster</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Deploy LLM on Kubernetes using OpenLLM</title>
      <dc:creator>Pavan Shiraguppi</dc:creator>
      <pubDate>Wed, 16 Aug 2023 06:32:17 +0000</pubDate>
      <link>https://forem.com/cloudraft/deploy-llms-on-kubernetes-using-openllm-3g9c</link>
      <guid>https://forem.com/cloudraft/deploy-llms-on-kubernetes-using-openllm-3g9c</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Natural Language Processing (NLP) has evolved significantly, with Large Language Models (LLMs) at the forefront of cutting-edge applications. Their ability to understand and generate human-like text has revolutionized various industries. Deploying and testing these LLMs effectively is crucial for harnessing their capabilities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/bentoml/OpenLLM" rel="noopener noreferrer"&gt;OpenLLM&lt;/a&gt; is an open-source platform for operating large language models (LLMs) in production. It allows you to run inference on any open-source LLMs, fine-tune them, deploy, and build powerful AI apps with ease.&lt;/p&gt;

&lt;p&gt;This blog post explores the deployment of LLM models using the OpenLLM framework on a Kubernetes infrastructure. For the purpose of the demo, I am using a hardware setup consisting of an RTX 3060 GPU and an Intel i7 12700K processor, we delve into the technical aspects of achieving optimal performance.&lt;/p&gt;

&lt;h1&gt;
  
  
  Environment Setup and Kubernetes Configuration
&lt;/h1&gt;

&lt;p&gt;Before diving into LLM deployment on Kubernetes, we need to ensure the environment is set up correctly and the Kubernetes cluster is ready for action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing the Kubernetes Cluster
&lt;/h2&gt;

&lt;p&gt;Setting up a Kubernetes cluster requires defining worker nodes, networking, and orchestrators. Ensure you have Kubernetes installed and a cluster configured. This can be achieved through tools like &lt;code&gt;kubeadm&lt;/code&gt;, &lt;code&gt;minikube&lt;/code&gt;, kind or managed services such as Google Kubernetes Engine (GKE) and Amazon EKS.&lt;/p&gt;

&lt;p&gt;If you are using kind cluster, you can create cluster as following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kind create cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installing Dependencies and Resources
&lt;/h2&gt;

&lt;p&gt;Within the cluster, install essential dependencies such as NVIDIA GPU drivers, CUDA libraries, and Kubernetes GPU support. These components are crucial for enabling GPU acceleration and maximizing LLM performance.&lt;/p&gt;

&lt;p&gt;To use CUDA on your system, you will need the following installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A CUDA-capable GPU&lt;/li&gt;
&lt;li&gt;A supported version of Linux with a gcc compiler and toolchain&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/cuda-downloads" rel="noopener noreferrer"&gt;CUDA Toolkit 12.2 at NVIDIA Developer portal&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Using OpenLLM to Containerize and Load Models
&lt;/h1&gt;

&lt;h2&gt;
  
  
  OpenLLM
&lt;/h2&gt;

&lt;p&gt;OpenLLM supports a wide range of state-of-the-art LLMs, including Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. It also provides flexible APIs that allow you to serve LLMs over RESTful API or gRPC with one command, or query via WebUI, CLI, our Python/Javascript client, or any HTTP client.&lt;/p&gt;

&lt;p&gt;Some of the key features of OpenLLM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support for a wide range of state-of-the-art LLMs&lt;/li&gt;
&lt;li&gt;Flexible APIs for serving LLMs&lt;/li&gt;
&lt;li&gt;Integration with other powerful tools&lt;/li&gt;
&lt;li&gt;Easy to use&lt;/li&gt;
&lt;li&gt;Open-source&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To use OpenLLM, you need to have Python 3.8 (or newer) and &lt;code&gt;pip&lt;/code&gt; installed on your system. We highly recommend using a Virtual Environment (like conda) to prevent package conflicts.&lt;/p&gt;

&lt;p&gt;You can install OpenLLM using pip as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;openllm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To verify if it's installed correctly, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openllm &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To start an LLM server, for example, to start an Open Pre-trained transformer model aka &lt;a href="https://huggingface.co/docs/transformers/model_doc/opt" rel="noopener noreferrer"&gt;OPT&lt;/a&gt; server, do the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openllm start opt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Selecting the LLM Model
&lt;/h2&gt;

&lt;p&gt;OpenLLM framework supports various pre-trained LLM models like GPT-3, GPT-2, and BERT. When selecting a large language model (LLM) for your application, the main factors to consider are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model size&lt;/strong&gt; - Larger models like GPT-3 have more parameters and can handle more complex tasks, while smaller ones like GPT-2 are better for simpler usecases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt; - Models optimized for generative AI like GPT-3 or understanding (e.g. BERT) align with different use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training data&lt;/strong&gt; - More high-quality, diverse data leads to better generalization capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning&lt;/strong&gt; - Pre-trained models can be further trained on domain-specific data to improve performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alignment with usecase&lt;/strong&gt;- Validate potential models on your specific application and data to ensure the right balance of complexity and capability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ideal LLM matches your needs in terms of complexity, data requirements, compute resources, and overall capability. Thoroughly evaluate options to select the best fit. For this demo, we will be using the Dolly-2 model with 3B parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loading the Chosen Model within a Container
&lt;/h2&gt;

&lt;p&gt;Containerization enhances reproducibility and portability. Package your LLM model, OpenLLM dependencies, and other relevant libraries within a Docker container. This ensures a consistent runtime environment across different deployments.&lt;/p&gt;

&lt;p&gt;With OpenLLM, you can easily build a Bento for a specific model, like &lt;code&gt;dolly-v2-3b&lt;/code&gt;, using the &lt;code&gt;build&lt;/code&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openllm build dolly-v2 &lt;span class="nt"&gt;--model-id&lt;/span&gt; databricks/dolly-v2-3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this demo, we are using BentoML, an MLOps platform and also the parent organization behind OpenLLM project. A &lt;a href="https://docs.bentoml.com/en/latest/concepts/bento.html#what-is-a-bento" rel="noopener noreferrer"&gt;Bento&lt;/a&gt;, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artifacts, and dependencies.&lt;/p&gt;

&lt;p&gt;To Containerize your Bento, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bentoml containerize &amp;lt;name:version&amp;gt; &lt;span class="nt"&gt;-t&lt;/span&gt; dolly-v2-3b:latest &lt;span class="nt"&gt;--opt&lt;/span&gt; &lt;span class="nv"&gt;progress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;plain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This generates an OCI-compatible docker image that can be deployed anywhere docker runs.&lt;/p&gt;

&lt;p&gt;You will be able to locate the docker image in &lt;code&gt;$BENTO_HOME\bentos\stabilityai-stablelm-tuned-alpha-3b-service\$id\env\docker&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Model Inference and High Scalability using Kubernetes
&lt;/h1&gt;

&lt;p&gt;Executing model inference efficiently and scaling up when needed are key factors in a Kubernetes-based LLM deployment. The reliability and scalability features of Kubernetes can help efficiently scale the model for the production usecase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running LLM Model Inference
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pod Communication&lt;/strong&gt;: Set up communication protocols within pods to manage model input and output. This can involve RESTful APIs or gRPC-based communication.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenLLM has a gRPC server running by default on port 3000. We can have a deployment file as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-3b:latest&lt;/span&gt;
          &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note&lt;/em&gt;&lt;/strong&gt;: We will be assuming that the image is available locally with the name dolly-v2-3b with the latest tag. If the image is pushed to the repository, then make sure to remove the imagePullPolicy line and provide the credentials to the repository as secrets if it is a private repository.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Service&lt;/strong&gt;: Expose the deployment using services to distribute incoming inference requests evenly among multiple pods.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We set up a &lt;code&gt;LoadBalancer&lt;/code&gt; type service in our Kubernetes cluster that gets exposed on port 80. If you are using Ingress then it will be &lt;code&gt;ClusterIP&lt;/code&gt; instead of &lt;code&gt;LoadBalancer&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalancer&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Horizontal Scaling and Autoscaling&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Autoscaling (HPA)&lt;/strong&gt;: Configure HPAs to automatically adjust the number of pods based on CPU or custom metrics. This ensures optimal resource utilization.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can declare an HPA yaml for CPU configuration as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;targetCPUUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For GPU configuration, To gather GPU metrics in Kubernetes, follow this blog to install the DCGM server: &lt;a href="https://iamajayr.medium.com/kubernetes-hpa-using-gpu-metrics-e366ddbfedb7" rel="noopener noreferrer"&gt;Kubernetes HPA using GPU metrics&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After installation of the DCGM server, we can use the following to create HPA for GPU memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1beta1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Object&lt;/span&gt;
      &lt;span class="na"&gt;object&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt; &lt;span class="c1"&gt;# kubectl get svc | grep dcgm&lt;/span&gt;
        &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DCGM_FI_DEV_MEM_COPY_UTIL&lt;/span&gt;
        &lt;span class="na"&gt;targetValue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Autoscaling&lt;/strong&gt;: Enable cluster-level autoscaling to manage resource availability across multiple nodes, accommodating varying workloads. Here are the key steps to configure cluster autoscaling in Kubernetes:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Install the Cluster Autoscaler plugin:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kubernetes/autoscaler/releases/download/v1.20.0/cluster-autoscaler-component.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Configure auto scaling by setting min/max nodes in your cluster config.&lt;/li&gt;
&lt;li&gt;Annotate node groups you want to scale automatically:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl annotate node POOL_NAME cluster-autoscaler.kubernetes.io/safe-to-evict&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Deploy an auto scaling-enabled application, like an HPA-based deployment. The autoscaler will scale the node pool when pods are unschedulable.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Configure auto scaling parameters as needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adjust scale-up/down delays with &lt;code&gt;--scale-down-delay&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set scale-down unneeded time with &lt;code&gt;--scale-down-unneeded-time&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Limit scale speed with &lt;code&gt;--max-node-provision-time&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Monitor your cluster autoscaling events:&lt;br&gt;&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get events | &lt;span class="nb"&gt;grep &lt;/span&gt;ClusterAutoscaler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Performance Analysis of LLMs in a Kubernetes Environment
&lt;/h1&gt;

&lt;p&gt;Evaluating the performance of LLM deployment within a Kubernetes environment involves latency measurement and resource utilization assessment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency Evaluation
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Measuring Latency&lt;/strong&gt;: Use tools like &lt;code&gt;kubectl exec&lt;/code&gt; or custom scripts to measure the time it takes for a pod to process an input prompt and generate a response. Refer the below python script to determine latency metrics of the GPU.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Python Program to test Latency and Tokens/sec.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;

&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;databricks/dolly-v2-3b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sample text for benchmarking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="n"&gt;times&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enable_timing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enable_timing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Start timer
&lt;/span&gt;    &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Model inference
&lt;/span&gt;    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;
    &lt;span class="c1"&gt;# End timer
&lt;/span&gt;    &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Sync and get time
&lt;/span&gt;    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;elapsed_time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate TPS
&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;tps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Calculate latency
&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="c1"&gt;# in ms
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avg TPS: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tps&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avg Latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Comparing Latency using Aviary&lt;/strong&gt;: &lt;a href="https://aviary.anyscale.com/" rel="noopener noreferrer"&gt;Aviary&lt;/a&gt; is a valuable tool for developers who want to get started with LLMs, or who want to improve the performance and scalability of their LLM-based applications. It is easy to use and provides a number of features that make it a great choice for both beginners and experienced developers.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Resource Utilization and Scalability Insights
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring Resource Consumption&lt;/strong&gt;: Utilize Kubernetes dashboard or monitoring tools like Prometheus and Grafana to observe resource usage patterns across pods.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability Analysis&lt;/strong&gt;: Analyze how Kubernetes dynamically adjusts resources based on demand, ensuring resource efficiency and application responsiveness.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;We have tried to put up an in-depth technical analysis that demonstrates the immense value of leveraging Kubernetes for LLM deployments. By combining GPU acceleration, specialized libraries, and Kubernetes orchestration capabilities, LLMs can be deployed with significantly improved performance and for a large scale. In particular, GPU-enabled pods achieved over 2x lower latency and nearly double the inference throughput compared to CPU-only variants. Kubernetes autoscaling also allowed pods to be scaled horizontally on demand, so query volumes could increase without compromising responsiveness.&lt;/p&gt;

&lt;p&gt;Overall, the results of this analysis validate that Kubernetes is the best choice for deploying LLMs at scale. The synergy between software and hardware optimization on Kubernetes unlocks the true potential of LLMs for real-world NLP use cases.&lt;/p&gt;

&lt;p&gt;If you are looking for help implementing LLMs on Kubernetes, we would love to hear how you are scaling LLMs. Please &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;contact us&lt;/a&gt; to discuss your specific problem statement.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>docker</category>
      <category>devops</category>
      <category>programming</category>
    </item>
    <item>
      <title>Running Containers in Azure</title>
      <dc:creator>Asmi-KR</dc:creator>
      <pubDate>Tue, 11 Jul 2023 07:44:04 +0000</pubDate>
      <link>https://forem.com/cloudraft/running-containers-in-azure-30ag</link>
      <guid>https://forem.com/cloudraft/running-containers-in-azure-30ag</guid>
      <description>&lt;p&gt;Microservice is an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. It is difficult to talk about microservices without talking about containers. These services are containerized and deployed on a container platform such as Docker.&lt;/p&gt;

&lt;p&gt;Before exploring various services provided by Microsoft Azure, let’s quickly learn about the container. A container image has the software and its dependencies packaged into an immutable artifact. Each change in the Image forms a layer. Containerization helps developers to create and deploy applications faster and more securely.&lt;/p&gt;

&lt;p&gt;Microsoft Azure provides various services to run the containers in their cloud computing platform. In this article, you will learn about some of the services available in Azure for container deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Azure Container Instances (ACI)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmpzdf1dbf6e5j6nftah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmpzdf1dbf6e5j6nftah.png" alt="Image description" width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Azure Container Instances (ACI) is a service to operate containers in an&lt;br&gt;
isolated environment, without worrying about its orchestration. Some of the use&lt;br&gt;
cases of ACI are data processing, event-driven applications, running containers&lt;br&gt;
for immediate use with minimal effort, short-lived batch jobs, and running&lt;br&gt;
development or test environments.&lt;/p&gt;

&lt;p&gt;ACI provides excellent flexibility, allowing you to deploy individual containers or multi-container applications. It can be considered as a low-level "building block" option compared to Container Apps. Advanced features like autoscaling, load balancing, and automatic certificates are not provided by ACI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Azure Kubernetes Service (AKS)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh0rcbqfgpvj3h17ulliv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh0rcbqfgpvj3h17ulliv.png" alt="Image description" width="640" height="774"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Azure Kubernetes Service (AKS) is a fully managed container orchestration service, built on the&lt;br&gt;
popular Kubernetes platform. AKS provides a managed control plane, reducing the&lt;br&gt;
operational overhead in running the Kubernetes cluster. AKS simplifies the&lt;br&gt;
deployment, management, and scaling of containerized applications. It provides&lt;br&gt;
features like automated scaling, load-balancing, and self-healing capabilities&lt;br&gt;
including all features of upstream Kubernetes. AKS is suitable for complex,&lt;br&gt;
production-grade applications that require high availability, scalability, and&lt;br&gt;
control. Additionally, AKS integrates natively with other Azure services like&lt;br&gt;
Azure DevOps integration, Azure Active Directory authentication, and Azure&lt;br&gt;
Monitor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Azure Service Fabric
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bWmtYpWN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://learn.microsoft.com/en-us/azure/service-fabric/media/service-fabric-cloud-services-migration-differences/topology-service-fabric.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bWmtYpWN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://learn.microsoft.com/en-us/azure/service-fabric/media/service-fabric-cloud-services-migration-differences/topology-service-fabric.png" alt="Service fabric topology" width="762" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Azure Service Fabric is Microsoft's distributed Platform-as-a-Service (PaaS) used to build and deploy microservices-based cloud applications. It supports both containerized and non-containerized workloads. With Service Fabric, you can deploy containers to a managed cluster and take advantage of its robust scalability, high availability, and automatic scaling features.&lt;/p&gt;

&lt;p&gt;Service Fabric addresses the significant challenges in developing and managing cloud apps. Service Fabric can deploy Docker and Windows Server containers. It also supports arbitrary executables and direct, code-level integrations as stateful services that can run alongside containerized services. Users can integrate with Azure Pipelines, DevOps Services, Monitor, and Key Vault within the scope of a Service Fabric application. Service Fabric is a good choice for running applications with complex inter-service communication and stateful requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Azure Functions with Containers
&lt;/h2&gt;

&lt;p&gt;Azure Functions allows you to run serverless functions deployed as containers. It combines the flexibility of containers with the event-driven, pay-per-execution model of Azure Functions. With this option, you can build and deploy serverless applications packaged as containers.&lt;/p&gt;

&lt;p&gt;Azure Functions with containers offers seamless integration with other Azure services, event sources, and triggers. It is suitable for scenarios where you want to leverage serverless capabilities while maintaining the control and portability of containerized applications. When you create a Functions project using Azure Functions Core Tools and include the &lt;code&gt;--docker&lt;/code&gt; option, core tools also generates a Dockerfile that you can use to create your container from the correct base image.&lt;/p&gt;

&lt;h2&gt;
  
  
  Azure App Service
&lt;/h2&gt;

&lt;p&gt;App Service is a Platform as a Service (PaaS) offering from Microsoft. Typically it is used to host HTTP-based web applications, REST APIs, and backend services for mobile applications. You can write these applications in your favorite language, be it .NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. It includes automatic scaling, continuous deployment, and built-in support for popular programming languages and frameworks.&lt;/p&gt;

&lt;p&gt;It enables developers to focus on creating outstanding applications rather than worrying about infrastructure administration and also offers a comprehensive range of tools for developing, deploying, and monitoring apps, as well as integration with Azure DevOps and other popular DevOps tools. It provides a simple and intuitive deployment experience and is suitable for lightweight, web-focused containerized applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to decide which one is a good fit?
&lt;/h2&gt;

&lt;p&gt;The Azure team has shared this decision tree that can be helpful to identify which service is right for your use case.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0_mnh0l8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/images/compute-choices.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0_mnh0l8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/images/compute-choices.png" title="Compute choices in Azure" alt="Decision tree" width="800" height="871"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Credits: Microsoft documentation&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Service Comparison
&lt;/h2&gt;

&lt;p&gt;I have tried to sum up the use cases and benefits in the below table.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Azure Services&lt;/th&gt;
&lt;th&gt;Use cases&lt;/th&gt;
&lt;th&gt;Benefits&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure Container Instances (ACI)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Running containers instantly, batch jobs&lt;/td&gt;
&lt;td&gt;No infrastructure managment, quick and easy deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure Kubernetes Service (AKS)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High availability and scalability, critical services&lt;/td&gt;
&lt;td&gt;Higher reliability and control, fully managed orchestration, &lt;br&gt; more control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure Service Fabric&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good for microservices based application, app has complex inter-service communication&lt;/td&gt;
&lt;td&gt;Robust scalability, high availability, and built support for state management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure functions with Containers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless and event-driven applications&lt;/td&gt;
&lt;td&gt;serverless execution and seamless integration with other services, low cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure App Services&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Web applications, light-weight services&lt;/td&gt;
&lt;td&gt;fully managed platform, automatic scaling and load balancing, seamless integration with other Azure services&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Azure offers a comprehensive set of options for deploying containerized applications, catering to a wide range of scenarios and requirements. Whether you prefer serverless execution, container orchestration, microservices architecture, or a combination of these approaches, Azure has a solution. By leveraging these services you can take advantage of Azure's scalability, reliability, and integration capabilities to deploy and manage your containerized applications with ease. Choose the option that best aligns with your application's needs and start realizing the benefits of containerization in Azure.&lt;br&gt;
If you need additional help in understanding the right choice, please don't hesitate to &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;contact us&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  About the Guest Author
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/smita-aglave-b6b8b454/" rel="noopener noreferrer"&gt;Smita&lt;/a&gt;, previously an IT Trainer, dedicated numerous years to assisting individuals and organizations in gaining knowledge about diverse technologies and software development methodologies. Currently, her growing fascination lies in the realm of DevOps, prompting her to delve deeper into research within this field. Smita possesses a profound passion for writing and takes pleasure in disseminating the knowledge she acquires along her journey.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Secure Coding Best Practices</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Sat, 17 Jun 2023 13:19:12 +0000</pubDate>
      <link>https://forem.com/cloudraft/secure-coding-best-practices-2c62</link>
      <guid>https://forem.com/cloudraft/secure-coding-best-practices-2c62</guid>
      <description>&lt;p&gt;Every single day, an extensive array of fresh software vulnerabilities is unearthed by diligent security researchers and analysts. A considerable portion of these vulnerabilities emerges due to the absence of secure coding practices. Exploiting such vulnerabilities can have severe consequences, as they possess the potential to severely impair the financial or physical assets of a business, erode trust, or disrupt critical services.&lt;/p&gt;

&lt;p&gt;For organisations reliant on their software for their operations, it becomes imperative for software developers to embrace secure coding practices. Secure coding entails a collection of practices that software developers adopt to fortify their code against cyberattacks and vulnerabilities. By adhering to coding standards that embody best practices, developers can incorporate safeguards that minimise the risks posed by vulnerabilities in their code.&lt;/p&gt;

&lt;p&gt;In a world brimming with cyber threats, secure coding cannot be viewed as optional if a business intends to maintain its shield of protection.&lt;/p&gt;

&lt;p&gt;This article, we will explore some anti-patterns and best practices we can include in our workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-patterns
&lt;/h2&gt;

&lt;p&gt;Now, let's briefly discuss various common mistakes or anti-patterns, categorised into insecure coding. The following are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insufficient validation of input data or processing inputs without proper encoding or sanitisation.&lt;/li&gt;
&lt;li&gt;Constructing SQL queries by concatenating strings, making the code vulnerable to data leaks or injection attacks.&lt;/li&gt;
&lt;li&gt;Failure to implement robust authentication, such as storing credentials in plain text without proper hashing and encryption.&lt;/li&gt;
&lt;li&gt;Poor design of password recovery mechanisms and infrequent rotation of security keys.&lt;/li&gt;
&lt;li&gt;Software planning and design lacking strong authorisation schemes.&lt;/li&gt;
&lt;li&gt;Granting excessive privileges during development or troubleshooting.&lt;/li&gt;
&lt;li&gt;Exposing sensitive information in debug logging without appropriate redaction.&lt;/li&gt;
&lt;li&gt;Utilising third-party libraries from untrusted sources or neglecting security checks.&lt;/li&gt;
&lt;li&gt;Unsafe handling of memory pointers or allowing pointer access beyond system boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these common mistakes in mind, let's explore practices and tools that can guide developers towards secure coding practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secure Coding Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Shift left in software development lifecycle
&lt;/h3&gt;

&lt;p&gt;Historically, the conventional practice involved assigning the software security team to conduct security testing towards the conclusion of a software development project. The team would assess the application and compile a list of issues that require resolution. At this stage, the identified fixes would be prioritised, resulting in some vulnerabilities being addressed while others remained unattended. The reasons for leaving certain vulnerabilities unresolved could range from cost constraints and limited resources to pressing business priorities.&lt;/p&gt;

&lt;p&gt;However, this traditional approach is no longer sustainable. Security considerations must now be incorporated right from the outset—the initial stages—of the software development lifecycle. Security should be taken into account during the design phase itself. Both manual and automated testing should be conducted throughout the application's implementation as part of the Continuous Integration (CI) pipeline, ensuring that developers receive prompt feedback.&lt;/p&gt;

&lt;p&gt;To aid in this endeavour, the utilisation of static code analysis becomes invaluable. This technique enables the scanning of code for security flaws and risks, even while developers are actively writing it within an integrated development environment (IDE). For instance, SAST tools offers the ability to analyse the code for security vulnerabilities during the development process, facilitating early identification and mitigation of potential risks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Input validation
&lt;/h3&gt;

&lt;p&gt;Ensuring the integrity of input data as it enters a system holds great significance. It is essential to validate the syntactic and semantic accuracy of all incoming data, considering it as untrusted. Employing checks and regular expressions aids in verifying the correctness, size, and syntax of the input.&lt;/p&gt;

&lt;p&gt;Performing these validations on the server side is highly recommended. In the case of web applications, it involves scrutinising various components, including HTTP headers, cookies, GET and POST parameters, as well as file uploads.&lt;/p&gt;

&lt;p&gt;Client-side validation also proves beneficial, contributing to an enhanced user experience by reducing the need for multiple network requests resulting from invalid inputs. This approach minimises back-and-forth communication and enhances efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameterised queries
&lt;/h3&gt;

&lt;p&gt;During the process of storing and retrieving data, developers frequently engage with datastores. However, if they overlook the utilisation of parametrised queries, it can expose an opportunity for attackers to exploit widely accessible tools and manipulate inputs to extract sensitive information. SQL injection, a highly perilous application risk, exemplifies a common form of such attacks.&lt;/p&gt;

&lt;p&gt;By incorporating placeholders for parameters within the query, the specified parameters are treated as data rather than being considered as part of the SQL command itself. To mitigate these vulnerabilities, it is recommended to employ prepared statements or object-relational mapping (ORM) techniques. These approaches offer effective measures to safeguard against SQL injection and related threats.&lt;/p&gt;

&lt;h3&gt;
  
  
  Encoding data
&lt;/h3&gt;

&lt;p&gt;Encoding data plays a vital role in mitigating threats by transforming potentially hazardous special characters into a sanitised format. Base64 encoding serves as an exemplar of such encoding techniques, offering protection against SQL injection, cross-site scripting (XSS), and client-side injection attacks.&lt;/p&gt;

&lt;p&gt;To enhance security, it is crucial to specify appropriate character sets, such as UTF-8, and encode data into a standardised character set before further processing. Additionally, employing canonicalisation techniques proves beneficial. For instance, simplifying characters to their basic form helps address issues such as double encoding and obfuscation attacks, thereby bolstering overall security measures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement identity and authentication controls
&lt;/h3&gt;

&lt;p&gt;To further enhance security and minimise the risk of breaches, secure coding practices emphasise the importance of verifying a user's identity at the outset and integrating robust authentication controls into the application's code.&lt;/p&gt;

&lt;p&gt;Here are some recommended measures to achieve this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Employ strong authentication methods, such as multi-factor authentication, to add an additional layer of security.&lt;/li&gt;
&lt;li&gt;Consider incorporating biometric authentication methods, such as fingerprint or facial recognition, especially in mobile applications.&lt;/li&gt;
&lt;li&gt;Ensure secure storage of passwords. Typically, this involves hashing the password using a strong hashing function and securely storing the encrypted hash in a database.&lt;/li&gt;
&lt;li&gt;Implement a secure password recovery mechanism to facilitate password resets while maintaining security.&lt;/li&gt;
&lt;li&gt;Enable session timeouts and inactivity periods to automatically terminate idle sessions.&lt;/li&gt;
&lt;li&gt;For sensitive operations like modifying account information, enforce re-authentication to validate the user's identity.&lt;/li&gt;
&lt;li&gt;Conduct regular audits of authentication transactions to detect any suspicious activities and maintain a vigilant stance against potential threats.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implement access controls
&lt;/h3&gt;

&lt;p&gt;Incorporating a well-thought-out authorisation strategy during the initial stages of application development can greatly enhance the overall security posture. Authorisation entails determining the specific resources that an authenticated user can or cannot access.&lt;/p&gt;

&lt;p&gt;Consider the following guidelines to strengthen the authorisation framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Establish a sequential flow of authentication followed by authorisation. Implement a mechanism where all requests undergo access control checks.&lt;/li&gt;
&lt;li&gt;Adhere to the principle of least privilege, initially denying access to any resource that has not been explicitly configured for access control.&lt;/li&gt;
&lt;li&gt;Enforce time-based limitations on user or system component actions by implementing expiration times, thereby ensuring that actions have defined timeframes for execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By following these practices, developers can create a robust and effective authorisation system that bolsters the overall security of the application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Protect sensitive data
&lt;/h3&gt;

&lt;p&gt;In order to comply with legal and regulatory obligations, it is the responsibility of businesses to safeguard customer data. This sensitive data encompasses various categories, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Personally identifiable information (PII)&lt;/li&gt;
&lt;li&gt;Financial transactions&lt;/li&gt;
&lt;li&gt;Health records&lt;/li&gt;
&lt;li&gt;Web browser data&lt;/li&gt;
&lt;li&gt;Mobile data etc&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To prevent data leakage, it is crucial to employ robust encryption methods for both data at rest and data in transit. Consider the following practices to enhance data protection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Utilise a well-established, peer-reviewed cryptographic library and functions that have been vetted and approved by your security team.&lt;/li&gt;
&lt;li&gt;Avoid storing encryption keys alongside the encrypted data to prevent unauthorised access.&lt;/li&gt;
&lt;li&gt;Refrain from storing confidential or sensitive data in memory, temporary locations, or log files during processing.&lt;/li&gt;
&lt;li&gt;Implement redaction technique in log forwarders to remove sensitive information.&lt;/li&gt;
&lt;li&gt;Implement mandatory re-authentication when accessing sensitive data within the application.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implement logging and intrusion detection
&lt;/h3&gt;

&lt;p&gt;Even the most meticulously designed system can be susceptible to exploitation by attackers. Therefore, it is advisable to incorporate a monitoring system that can detect and identify unusual events. It is crucial to ensure that sufficient information is logged concerning authentication, authorisation, and resource access events. This logging should include details such as timestamps, the origin of access requests, IP addresses, and information pertaining to the requested resource. It is important to store this information in a secure and protected log. Typically, these logs are transmitted in real time to a centralised system where they are analysed for any anomalies. Prior to logging, apply encoding techniques to the untrusted data to safeguard against log injection attacks.&lt;/p&gt;

&lt;p&gt;In the event of a security breach, it is essential to have a well-documented playbook in place to promptly terminate system access, mitigating the risk of further data leakage. By following these practices, organisations can enhance their ability to detect and respond to potential intrusions, minimising the impact of security incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leverage security frameworks and libraries
&lt;/h3&gt;

&lt;p&gt;Avoid unnecessary duplication of effort. Instead, leverage established security frameworks and libraries that have been proven effective. When incorporating such components into your project, ensure they are sourced from reliable and trusted third-party repositories. It is important to regularly assess these libraries for any vulnerabilities or weaknesses and proactively keep them up to date.&lt;/p&gt;

&lt;p&gt;By adopting this approach, you can benefit from the expertise and experience embedded in these established security solutions, saving valuable time and effort while maintaining a strong security posture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor error and exception handling
&lt;/h3&gt;

&lt;p&gt;In line with the best practices of logging, it is advisable to adopt a centralised approach for handling and monitoring errors and exceptions with tools like Sentry. Effective management of errors and exceptions is crucial as mishandling them can inadvertently expose valuable information to potential attackers, enabling them to gain insights into your application and platform design.&lt;/p&gt;

&lt;p&gt;Consider the following measures to strengthen error and exception handling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Avoid logging sensitive information within error messages to prevent inadvertent disclosure.&lt;/li&gt;
&lt;li&gt;Regularly conduct code reviews to identify and address any weaknesses or vulnerabilities in the error handling implementation.&lt;/li&gt;
&lt;li&gt;Utilise negative testing techniques, such as exploratory and penetration testing, fuzzing, and fault injection, to actively identify and rectify potential issues related to error handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By implementing these practices, you can ensure that error and exception handling is performed securely and with minimal risk of exposing sensitive information to potential attackers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of implementing secure coding practices
&lt;/h2&gt;

&lt;p&gt;At this point, the advantages of embracing secure coding practices should be evident:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incorporating automated checks and code analysis during the development process enhances developer productivity by promptly providing feedback to improve code security. This leads to quicker time-to-market and higher-quality code.&lt;/li&gt;
&lt;li&gt;Cost optimisation within the software development lifecycle is achieved by minimising bugs at the early stages.&lt;/li&gt;
&lt;li&gt;Static application security testing (SAST) tools offer developers of all skill levels guardrails, AppSec governance, and valuable insights through IDE plugins. These tools equip developers with the necessary knowledge and resources to bolster application security.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Throughout our examination of coding flaws that can result in vulnerabilities, we have also explored best practices to enhance the security stance of software. However, in the context of large-scale projects, it can be daunting to implement these practices while ensuring proper governance.&lt;/p&gt;

&lt;p&gt;In the realm of extensive projects, the following considerations can help navigate these challenges effectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Establish clear governance frameworks that outline security requirements, procedures, and responsibilities.&lt;/li&gt;
&lt;li&gt;Develop comprehensive guidelines and standards that align with secure coding practices and provide actionable steps for implementation.&lt;/li&gt;
&lt;li&gt;Foster collaboration and communication among development teams, security experts, and stakeholders to ensure a shared understanding of security goals and the necessary measures to achieve them.&lt;/li&gt;
&lt;li&gt;Prioritise the implementation of security measures by identifying high-risk areas and focusing resources accordingly.&lt;/li&gt;
&lt;li&gt;Regularly assess and review the security posture of the software throughout the development lifecycle, enabling continuous improvement and adjustments as needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By adopting these approaches, the process of implementing secure coding practices within large projects becomes more manageable and ensures that proper governance is in place to safeguard against vulnerabilities effectively.&lt;/p&gt;

&lt;p&gt;It is advisable to create and automate workflows using SAST tools and integrate in CI to enforce the best practices. Feel free to schedule a non-obligatory &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;call with us&lt;/a&gt; to discuss DevSecOps strategy and we can help you improve your current practice.&lt;/p&gt;

</description>
      <category>devsecops</category>
      <category>security</category>
      <category>consulting</category>
    </item>
  </channel>
</rss>
