<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Anjul Sahu</title>
    <description>The latest articles on Forem by Anjul Sahu (@anjuls).</description>
    <link>https://forem.com/anjuls</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F353412%2F7868dbdb-029d-4ab4-b332-7c1cfdc1476a.jpg</url>
      <title>Forem: Anjul Sahu</title>
      <link>https://forem.com/anjuls</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/anjuls"/>
    <language>en</language>
    <item>
      <title>Heroku to Kubernetes Migration: Clock is ticking</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Wed, 11 Feb 2026 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/heroku-to-kubernetes-migration-clock-is-ticking-13g7</link>
      <guid>https://forem.com/cloudraft/heroku-to-kubernetes-migration-clock-is-ticking-13g7</guid>
      <description>&lt;p&gt;For years, Heroku has been a beloved starting point for countless high-growth companies. It was revolutionary, making the deployment of an idea almost trivial. That focus on the developer experience—on simply pushing code and having it run—is why so many successful Minimum Viable Products (MVPs) and early-stage platforms were born there. It allowed engineering leadership to focus on product-market fit (PMF) instead of infrastructure.&lt;/p&gt;

&lt;p&gt;But a platform that simplifies everything also imposes limits, and for any company that has scaled past the initial bootstrap phase, those limits eventually hit two core metrics: &lt;strong&gt;control&lt;/strong&gt; and &lt;strong&gt;cost&lt;/strong&gt;. What starts as the fastest way to market often becomes a budget bottleneck and a strategic constraint.&lt;/p&gt;

&lt;p&gt;Today, with new structural changes at Heroku, the conversation about migration is no longer a matter of "if" or "when," but "now." For any business running a production-critical, profitable service, moving to Kubernetes is no longer just an optimization—it’s a necessary step to secure the next decade of growth and maintain technical sovereignty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Shift at Heroku
&lt;/h2&gt;

&lt;p&gt;On February 6, 2026, Heroku &lt;a href="https://www.heroku.com/blog/an-update-on-heroku/" rel="noopener noreferrer"&gt;announced&lt;/a&gt; a significant strategic realignment. The platform is now transitioning to what they call a sustaining engineering model.&lt;/p&gt;

&lt;p&gt;What does that actually mean for you as a business? It means a shift in investment priority. Heroku remains a stable, production-ready environment, with continued focus on core areas like security, stability, reliability, and support. For existing credit card-paying customers, the day-to-day operations and services remain unchanged.&lt;/p&gt;

&lt;p&gt;The critical piece of news, however, is that Enterprise Account contracts will no longer be offered to new customers. While existing enterprise contracts will be honored, this decision sends a clear strategic signal: Salesforce, the parent of Heroku, is focusing its future engineering efforts elsewhere—specifically on helping organizations build and deploy enterprise-grade AI in a secure way, rather than focusing on the core, undifferentiated platform features that many growth companies rely on.&lt;/p&gt;

&lt;p&gt;In short, the platform you relied on for your MVP is telling you, quite clearly, that its main focus is changing. For a high-growth business, relying on a platform that has decided to stop innovating in your core area of need is an unacceptable risk. The decision to migrate has now moved from a "good idea" to a strategic imperative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is Kubernetes a good choice?
&lt;/h2&gt;

&lt;p&gt;The cloud landscape has matured dramatically since Heroku first took center stage. While Heroku pioneered the developer-first experience, Kubernetes is already an industry standard and majority of the companies are already using it in Production. For any company that has achieved PMF, Kubernetes offers benefits that directly address the pain points of a scaled Heroku implementation. You may ask why not using products like Portainer, Render, Fly etc which have been an alternative. Yes, you can use them but it is still gaining more control on the platform and spending.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reclaiming Sovereignty and Control
&lt;/h3&gt;

&lt;p&gt;With Heroku, you are a tenant in a strictly controlled environment. That simplicity is powerful, but it comes at the cost of ultimate control. Kubernetes flips that dynamic. It gives you the blueprint for your entire infrastructure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multicloud and Hybrid Strategy:&lt;/strong&gt; Kubernetes is a universal API for infrastructure. It provides the freedom to easily shift workloads between major cloud providers (AWS, GCP, Azure), deploy on-premise, or adopt a hybrid strategy. This ability to change providers is a powerful negotiating tool and a key piece of business continuity planning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Sales Enablement:&lt;/strong&gt; For B2B SaaS, especially those with AI-native features, enterprise customers often require strict data sovereignty. They need to self-host services on their own virtual private clouds or on-premise. Heroku architecture simply cannot support this. A Kubernetes-based platform enables you to offer a self-deployed version of your SaaS product, unlocking massive new markets in highly regulated or security-conscious industries. The control Kubernetes offers over data residency and compliance is non-negotiable for selling to large enterprise customers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability and Cost Efficiency
&lt;/h3&gt;

&lt;p&gt;The Heroku pricing model is famously straightforward: it’s easy to calculate, but it is expensive as you scale. This is the trade-off for simplicity.&lt;/p&gt;

&lt;p&gt;By moving to Kubernetes, you gain fine-grained control over resource allocation. You can right-size your instances, consolidate workloads, and select the most cost-effective machine types for specific tasks. While the initial setup requires more attention, the long-term cost savings are significant, especially for services with unpredictable or high-volume usage.&lt;/p&gt;

&lt;p&gt;The ecosystem itself has worked to smooth out the initial complexity. Major cloud providers now offer "autopilot" in their managed Kubernetes services that handle much of the underlying operational overhead. This means you can gain the cost and control benefits of Kubernetes without the burden of building a huge platform engineering team.&lt;/p&gt;

&lt;p&gt;At CloudRaft, we recognize the need to simplify this process. We’ve built an accelerator called TurboRaft that is essentially a proven playbook for the modern Kubernetes platform. It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitOps with ArgoCD:&lt;/strong&gt; For zero-touch, automated, and auditable releases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Secured secret management, automated certificate management, SAST, SBOMs and vulnerability management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Open-source monitoring with options to choose from and alerting to keep costs low while maintaining deep insight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance:&lt;/strong&gt; Clear policies enforced for compliance and cost control.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to deliver the "Heroku-like" ease of use for developers, but on a platform you own and control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Maturation of the Kubernetes ecosystem
&lt;/h3&gt;

&lt;p&gt;A few years ago, managing Kubernetes was a job for seasoned experts. Today, the complexity angle has been largely mitigated by a robust and mature ecosystem. Open-source tooling, managed cloud services, and a deep community knowledge base have all contributed to making K8s a practical and reliable choice.&lt;/p&gt;

&lt;p&gt;The old argument that "Kubernetes is too complex" is mostly obsolete for a growing company. The market has solved the hardest parts. What’s left is a highly stable platform that provides the operational rigor required to run business-critical services. The Hacker News discussion &lt;a href="https://news.ycombinator.com/item?id=37379078" rel="noopener noreferrer"&gt;thread&lt;/a&gt; on the Heroku news highlights this exact sentiment, with many leaders realizing that the ecosystem is ready for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  A structured approach to migration
&lt;/h2&gt;

&lt;p&gt;No platform migration is easy; it’s a non-trivial engineering effort that must be planned as a business-critical project. Done correctly, it is an opportunity to not just move your app, but to make it stronger and more resilient for the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Assessment and Re-Architecture
&lt;/h3&gt;

&lt;p&gt;This is the most crucial phase. A migration should also be seen as a refactoring opportunity. If your application isn't strictly following cloud-native principles or the &lt;strong&gt;Twelve-Factor App&lt;/strong&gt; methodology, now is the time to correct it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Risk Identification:&lt;/strong&gt; We begin with a full risk assessment, examining each service in the application. We categorize them by current stability, coupling, and size to create a phased migration plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sizing and Cost Modeling:&lt;/strong&gt; Understanding the true resource needs of each service allows us to create accurate Kubernetes deployment specifications and a detailed cost projection for the new platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Simplifying the Developer Experience
&lt;/h3&gt;

&lt;p&gt;The biggest win of Heroku was the abstraction of infrastructure. We need to replicate that ease of use on Kubernetes. Developers should not need to become Kubernetes experts overnight.&lt;/p&gt;

&lt;p&gt;We convert services into Kubernetes deployments using Helm charts, then we abstract the low-level Kubernetes constructs. The goal is a simplified interface—whether it’s a basic YAML or JSON configuration—that lets developers manage their application settings without worrying about the underlying cluster management. This retains the core developer efficiency that made Heroku so appealing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Data Migration Challenge
&lt;/h3&gt;

&lt;p&gt;Applications are often the easy part; the database is where the real complexity lies. A successful migration requires a strategy for moving data with near-zero downtime.&lt;/p&gt;

&lt;p&gt;We strongly recommend self-hosted database solutions on Kubernetes, particularly CloudNativePG for PostgreSQL. Running your own highly-available, self-managed database on Kubernetes removes the premium cost of proprietary cloud-managed services while providing superior control over failover and disaster recovery. We’ve found CloudNativePG to be highly reliable and offer &lt;a href="http://www.cloudraft.io/postgresql-consulting" rel="noopener noreferrer"&gt;full consulting and support&lt;/a&gt; to ensure a smooth, near-zero-downtime data migration. The database upgrade and management was easy in Heroku and with CloudNativePG and our best practices, you can have the database on auto pilot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time to act is now
&lt;/h2&gt;

&lt;p&gt;The shift at Heroku is a clear alarm bell. Ignoring it means accepting escalating costs and a growing strategic risk. You now have a proven, mature, and cost-effective alternative in Kubernetes.&lt;/p&gt;

&lt;p&gt;Success in this migration hinges on two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Selecting a Proven Playbook:&lt;/strong&gt; You need a tested, end-to-end framework that accounts for application, database, and operational complexities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Right Team:&lt;/strong&gt; You need a partner who has navigated this journey before and can deliver the platform quickly, abstracting away the unnecessary complexity while leaving you with full control.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is where &lt;strong&gt;CloudRaft&lt;/strong&gt; comes in. We offer not just the accelerator, but the &lt;a href="https://dev.to/kubernetes-consulting"&gt;consulting&lt;/a&gt; and operational support to execute the migration and hand over a platform that is ready for enterprise-level growth. Don't wait until the cost pressure or strategic uncertainty becomes a crisis—secure your future with a modern, controlled, and cost-efficient Kubernetes platform today.&lt;/p&gt;

</description>
      <category>heroku</category>
    </item>
    <item>
      <title>Context Graphs for AI Agents: The Complete Implementation Guide</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Thu, 29 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/context-graphs-for-ai-agents-the-complete-implementation-guide-4jko</link>
      <guid>https://forem.com/cloudraft/context-graphs-for-ai-agents-the-complete-implementation-guide-4jko</guid>
      <description>&lt;h2&gt;
  
  
  Why Context Graphs Matter Now for AI Agents?
&lt;/h2&gt;

&lt;p&gt;In the past few months, AI has shifted from chatbots to agents, autonomous systems that don't just answer questions but make decisions, approve exceptions, route escalations, and execute workflows across enterprise systems. &lt;a href="https://foundationcapital.com/context-graphs-ais-trillion-dollar-opportunity/" rel="noopener noreferrer"&gt;Foundation Capital&lt;/a&gt; recently called this shift AI's "trillion-dollar opportunity," arguing that enterprise value is migrating from traditional systems of record to systems that capture decision traces, the "why" behind every action.&lt;/p&gt;

&lt;p&gt;But here's the problem: agents deployed without proper context infrastructure are failing at scale, with customers reporting "1,000+ AI instances with no way to govern them" and "all kinds of agentic tools that none talk to each other" as stated in &lt;a href="https://metadataweekly.substack.com/p/context-graphs-are-a-trillion-dollar" rel="noopener noreferrer"&gt;Metadata Weekly&lt;/a&gt;. The issue isn't the AI models themselves, it's that agents lack the structured knowledge foundation they need to reason reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Missing Infrastructure: Relationship-Based Context
&lt;/h3&gt;

&lt;p&gt;47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024 &lt;a href="https://sloanreview.mit.edu/projects/the-emerging-agentic-enterprise-how-leaders-must-navigate-a-new-age-of-ai/" rel="noopener noreferrer"&gt;MIT Sloan Management Review&lt;/a&gt;. Even when agents don't hallucinate outright, they struggle with multi-step reasoning that requires connecting distant facts across systems. An agent might know a customer filed a complaint and know about a recent product defect and know the refund policy, but fail to connect these relationships to understand why an exception should be granted.&lt;/p&gt;

&lt;p&gt;As Prukalpa Sankar, co-founder of Atlan, frames it: "In 2025, in the dawn of the AI era, context is king" in her &lt;a href="https://atlan.com/know/closing-the-context-gap/" rel="noopener noreferrer"&gt;article&lt;/a&gt;. Context Graphs provide this missing infrastructure by organizing information as an interconnected network of entities and relationships, enabling &lt;a href="https://dev.to/ai-solutions"&gt;AI agents&lt;/a&gt; to traverse meaningful connections, reason across multiple facts, and deliver explainable decisions.&lt;/p&gt;

&lt;p&gt;This comprehensive guide explains what Context Graphs are, how they work, and why they're becoming essential infrastructure for enterprise AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Context Graph? Definition, Use Cases &amp;amp; Implementation Guide
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1769615825%2Fblogs%2Fcontext-graph-for-ai-agents%2Fcontext_graph_hdua0t.avif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1769615825%2Fblogs%2Fcontext-graph-for-ai-agents%2Fcontext_graph_hdua0t.avif" alt="Context Graph" width="1536" height="1024"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Context Graphs Work
&lt;/h3&gt;

&lt;p&gt;Context Graphs transform raw data into a semantic network of nodes (entities like people or projects), directed edges (relationships such as "worked_on" or "depends_on"), and properties (key-value details on both). This structure enables AI agents to perform graph traversals, starting from a query node and following relevant edges, for dynamic context assembly and multi-hop reasoning, unlike rigid keyword or vector searches.&lt;/p&gt;

&lt;h4&gt;
  
  
  Core Components:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nodes:&lt;/strong&gt; Represent real-world entities (e.g. "ProjectX"). Each holds properties like name, type, or timestamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges:&lt;/strong&gt; Directed connections with types (e.g. → "worked_on" →) and properties (e.g. role: "lead", duration: "6 months"). Directions indicate flow, like cause-effect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Properties:&lt;/strong&gt; Metadata attached to nodes/edges (e.g., confidence score on an edge), enabling filtered traversals.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Traversal Process:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query Entry:&lt;/strong&gt; Input like "API security projects" matches starting nodes via properties or embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neighbor Expansion:&lt;/strong&gt; Fetch adjacent nodes/edges, prioritizing by relevance (e.g., recency, strength).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Hop Pathfinding:&lt;/strong&gt; Traverse 2-4 hops (e.g. Project → worked_on → Engineer → similar_to → AuthSystem), using algorithms like BFS or HNSW-inspired graphs for efficiency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Assembly:&lt;/strong&gt; Aggregate paths into a subgraph, feeding it to LLMs for grounded reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainability:&lt;/strong&gt; Log the path for auditing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This mirrors vector DB indexing (e.g. HNSW in Pinecone) but emphasizes relational paths over pure similarity.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example in Action:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Traditional Vector Search (e.g., Pinecone nearest-neighbor):&lt;/strong&gt; "API security projects" → Returns docs with similar embeddings (e.g. 3 keyword matches).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Graph Traversal:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="n"&gt;cypher&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Project&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;RELATED_TO&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Topic&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'API Security'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;related&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start:&lt;/strong&gt; Projects tagged "API Security".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hop 1:&lt;/strong&gt; → worked_on_by → Engineers (properties: skills="OAuth").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hop 2:&lt;/strong&gt; Engineers → also_worked_on → AuthSystems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hop 3:&lt;/strong&gt; AuthSystems → depends_on → OAuthProtocols (properties: version="2.0").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; Subgraph with projects, team, deps, contributors—plus path visualization for explainability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Characteristics of Context Graphs
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relationship-Centric Design:&lt;/strong&gt; Context Graphs prioritize connections over isolated records. This makes it natural to understand how concepts relate, not just what they contain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Hop Reasoning:&lt;/strong&gt; The graph structure enables AI to connect distant concepts through intermediate relationships, reasoning across multiple steps just as humans do. Example: Connecting "customer complaint" → "product defect" → "supplier issue" → "quality control process" in three hops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic Context Assembly:&lt;/strong&gt; Rather than retrieving fixed search results, Context Graphs assemble context on the fly by traversing only the relationships relevant to your specific query.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in Explainability:&lt;/strong&gt; Every AI decision can be traced back through its relationship path. You can see exactly how the system reached a conclusion, critical for enterprise and regulated environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Temporal Intelligence:&lt;/strong&gt; Context Graphs model sequences, dependencies, and cause-and-effect relationships over time, making them ideal for understanding evolving processes and events.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enterprise Scalability:&lt;/strong&gt; Modern graph databases handle millions of entities while maintaining fast traversal and query performance at scale.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Context Graph vs Knowledge Graph vs Vector Database
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Context Graph&lt;/th&gt;
&lt;th&gt;Knowledge Graph&lt;/th&gt;
&lt;th&gt;Vector Database&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary Focus&lt;/td&gt;
&lt;td&gt;Contextual relationships for AI reasoning&lt;/td&gt;
&lt;td&gt;General knowledge representation&lt;/td&gt;
&lt;td&gt;Semantic similarity matching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Type&lt;/td&gt;
&lt;td&gt;Multi-hop traversal&lt;/td&gt;
&lt;td&gt;Structured queries&lt;/td&gt;
&lt;td&gt;Nearest neighbor search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Dynamic AI context assembly&lt;/td&gt;
&lt;td&gt;Structured domain knowledge&lt;/td&gt;
&lt;td&gt;Semantic search, &lt;a href="https://dev.to/what-is/retrieval-augmented-generation"&gt;RAG&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explainability&lt;/td&gt;
&lt;td&gt;High (shows relationship paths)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low (similarity scores only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query Complexity&lt;/td&gt;
&lt;td&gt;Complex multi-step reasoning&lt;/td&gt;
&lt;td&gt;Medium complexity&lt;/td&gt;
&lt;td&gt;Simple similarity queries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; These technologies complement each other. Many advanced AI systems use Context Graphs for reasoning combined with &lt;a href="https://dev.to/blog/top-5-vector-databases"&gt;vector databases&lt;/a&gt; for semantic search.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Context Graph Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Knowledge Management:&lt;/strong&gt; Connect projects, people, decisions, and outcomes across your organization. Instead of finding where files live, trace how work evolved, what decisions shaped results, and who has relevant expertise. This will reduce your knowledge discovery time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intelligent Customer Support:&lt;/strong&gt; Go beyond keyword matching. Connect customer history, product configurations, known issues, and documented resolutions to provide contextually accurate answers. This will reduce your ticket resolution time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scientific Research &amp;amp; Discovery:&lt;/strong&gt; Connect millions of research papers, creating networks of studies, methodologies, findings, and citations. Discover unexpected connections between seemingly unrelated fields. You can identify underexplored research areas by analyzing relationship patterns and citation gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance &amp;amp; Risk Management:&lt;/strong&gt; Map relationships between regulations, internal policies, business processes, and controls. When requirements change, trace exactly where those changes affect systems and workflows. This will reduce your compliance audit preparation time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Healthcare Diagnostics:&lt;/strong&gt; Connect symptoms, medical history, medications, genetic factors, and research findings. Enable diagnostic systems to reason across these relationships and identify conditions that isolated analysis might miss. This will improve diagnostic accuracy by surfacing relevant but non-obvious connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supply Chain Optimization:&lt;/strong&gt; Model your entire supply network, suppliers, components, products, logistics partners, enabling sophisticated scenario analysis and rapid disruption response. For example, when supply issues arise, it will quickly identify alternative suppliers by traversing compatibility, certification, and performance relationships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal Research &amp;amp; Analysis:&lt;/strong&gt; Map relationships between cases, statutes, legal principles, and precedents. Trace how legal concepts evolved across jurisdictions and time periods. This would reduce legal research time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personalized Recommendations:&lt;/strong&gt; Go beyond "customers who bought this also bought that." Understand topical relationships, creator connections, and contextual relevance to deliver truly personalized recommendations. This would increase engagement through unexpected but relevant discoveries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial Risk Assessment:&lt;/strong&gt; Model relationships between entities, transactions, accounts, and market factors. Detect complex fraud patterns spanning multiple accounts and understand how risks cascade through connected entities. This would detect more fraud patterns than traditional rule-based systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Software Development Intelligence:&lt;/strong&gt; Map relationships between functions, modules, dependencies, documentation, and issues. Understand how code changes ripple through your system before making modifications. This would reduce breaking changes through comprehensive impact analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of Context Graphs for AI Agents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reduce AI Hallucinations:&lt;/strong&gt; Ground AI outputs in explicit, verifiable relationships rather than probabilistic pattern matching alone.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improve Reasoning Accuracy:&lt;/strong&gt; When answers require connecting multiple facts across domains, Context Graphs significantly outperform retrieval-only approaches.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable Explainable AI:&lt;/strong&gt; Expose the exact path the AI took through your knowledge graph, making decisions transparent and auditable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale Without Schema Rigidity:&lt;/strong&gt; Add new entity types and relationships without forcing disruptive schema migrations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Surface Hidden Insights:&lt;/strong&gt; Discover patterns and connections that are nearly impossible to detect in traditional table or document structures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Maintain Context Across Interactions:&lt;/strong&gt; Preserve relationship context throughout multi-turn conversations, enabling more sophisticated AI interactions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to Implement Context Graphs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Select Your Graph Database
&lt;/h3&gt;

&lt;p&gt;Choose based on scale, query patterns, and infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some Popular Options:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Neo4j:&lt;/strong&gt; Most mature, enterprise-ready, excellent query language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Neptune:&lt;/strong&gt; Managed AWS service, good for existing AWS infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TigerGraph:&lt;/strong&gt; Best for massive scale and complex analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ArangoDB:&lt;/strong&gt; Multi-model database with graph capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FalkorDB:&lt;/strong&gt; Ultra-fast in-memory graph database built on Redis, best for low-latency real-time applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision Factors:&lt;/strong&gt; Query complexity, data volume, team expertise, budget&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Design Your Relationship Schema
&lt;/h3&gt;

&lt;p&gt;The value of a Context Graph depends on modeling the right entities and relationships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best Practice:&lt;/strong&gt; Collaborate closely with domain experts who understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What entities matter in your domain&lt;/li&gt;
&lt;li&gt;Which relationships drive important decisions&lt;/li&gt;
&lt;li&gt;How information flows through your processes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Schema (Customer Support):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entities:&lt;/strong&gt; Customer, Ticket, Product, Issue, Resolution, Agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships:&lt;/strong&gt; reported_by, relates_to, resolved_with, escalated_to, similar_to&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Build Entity Extraction
&lt;/h3&gt;

&lt;p&gt;Identify entities in your source data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Unstructured Text:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use NLP pipelines&lt;/li&gt;
&lt;li&gt;Fine-tune LLMs for domain-specific entity recognition&lt;/li&gt;
&lt;li&gt;Implement human-in-the-loop validation for critical entities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For Structured Data:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Map existing database fields directly to graph entities&lt;/li&gt;
&lt;li&gt;Normalize entity references across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Develop Relationship Extraction
&lt;/h3&gt;

&lt;p&gt;Beyond identifying entities, determine how they relate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approaches:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rule-based:&lt;/strong&gt; Define explicit patterns (if X mentions Y in context Z, create relationship R)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ML-based:&lt;/strong&gt; Train models to identify relationship types from text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-based:&lt;/strong&gt; Use large language models for sophisticated relationship inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human validation:&lt;/strong&gt; Review critical relationship paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 5: Enable Real-Time Updates
&lt;/h3&gt;

&lt;p&gt;Context Graphs are living systems requiring continuous updates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement event-driven architecture for data changes&lt;/li&gt;
&lt;li&gt;Design incremental update patterns (don't rebuild everything)&lt;/li&gt;
&lt;li&gt;Maintain data lineage for troubleshooting&lt;/li&gt;
&lt;li&gt;Build conflict resolution for concurrent updates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 6: Optimize Query Performance
&lt;/h3&gt;

&lt;p&gt;Keep multi-hop queries responsive at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Index critical properties used in traversals&lt;/li&gt;
&lt;li&gt;Cache frequent query patterns&lt;/li&gt;
&lt;li&gt;Limit traversal depth for expensive queries&lt;/li&gt;
&lt;li&gt;Denormalize selectively for performance-critical paths&lt;/li&gt;
&lt;li&gt;Use query profiling to identify bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 7: Integrate Graph Analytics
&lt;/h3&gt;

&lt;p&gt;Enhance your Context Graph with advanced algorithms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PageRank:&lt;/strong&gt; Identify influential nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Detection:&lt;/strong&gt; Find clusters of related entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Path Finding:&lt;/strong&gt; Discover optimal routes through relationships&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph Embeddings:&lt;/strong&gt; Enable similarity calculations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Link Prediction:&lt;/strong&gt; Suggest missing relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Challenges &amp;amp; Solutions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;th&gt;Practical Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Graph Construction Complexity&lt;/td&gt;
&lt;td&gt;Building comprehensive graphs requires sophisticated entity and relationship extraction from unstructured data&lt;/td&gt;
&lt;td&gt;Start with a focused domain where you have high-quality structured data. Expand gradually as you build extraction capabilities.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Design Expertise&lt;/td&gt;
&lt;td&gt;Effective schemas demand deep domain understanding, poor design leads to unusable graphs&lt;/td&gt;
&lt;td&gt;Run workshops with subject matter experts. Build iteratively: start simple, refine based on actual query patterns.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance at Scale&lt;/td&gt;
&lt;td&gt;Graph traversals become expensive for complex multi-hop queries as data grows&lt;/td&gt;
&lt;td&gt;Invest in proper indexing, implement query optimization, use caching strategically, and set traversal depth limits (2-4 hops).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Entity Resolution&lt;/td&gt;
&lt;td&gt;Identifying that different mentions refer to the same entity is difficult but critical for accuracy&lt;/td&gt;
&lt;td&gt;Implement fuzzy matching, leverage unique identifiers where available, use ML-based entity resolution tools, maintain a golden record system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality Maintenance&lt;/td&gt;
&lt;td&gt;As graphs grow to millions of relationships, maintaining accuracy becomes challenging&lt;/td&gt;
&lt;td&gt;Implement automated validation rules, schedule periodic audits, track data lineage, enable user feedback loops for corrections.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration Complexity&lt;/td&gt;
&lt;td&gt;Incorporating Context Graphs into existing systems requires architectural changes and API design&lt;/td&gt;
&lt;td&gt;Build a graph API layer that existing systems can call. Start with read-only integration, add write capabilities once proven.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill Gap&lt;/td&gt;
&lt;td&gt;Shortage of professionals experienced in graph technologies and query languages like Cypher&lt;/td&gt;
&lt;td&gt;Train existing team members (graph databases are learnable, similar to SQL), hire contractors for initial setup, or partner with CloudRaft for implementation guidance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Management&lt;/td&gt;
&lt;td&gt;Context Graphs add infrastructure costs for databases, extraction pipelines, and real-time analytics&lt;/td&gt;
&lt;td&gt;Start with a high-value use case to demonstrate ROI. Scale infrastructure based on actual usage patterns. Monitor cost per query and optimize expensive operations.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Context Graph Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Design Principles
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model relationships that drive decisions:&lt;/strong&gt; Don't create relationships just because you can. Focus on connections that enable valuable reasoning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep entity types focused:&lt;/strong&gt; Avoid creating overly granular entity types. Each entity type should represent a meaningful concept in your domain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Make relationships meaningful:&lt;/strong&gt; Generic relationships like "related_to" provide little value. Use specific relationship types: "depends_on," "caused_by," "replaces."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Balance normalization and performance:&lt;/strong&gt; Highly normalized graphs are elegant but can be slow. Denormalize strategically for frequently traversed paths.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Version your schema:&lt;/strong&gt; Graph schemas evolve. Maintain version history and migration paths.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Query Optimization
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Limit traversal depth:&lt;/strong&gt; Set maximum hops to prevent runaway queries. Most valuable relationships are within 2-4 hops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Filter early:&lt;/strong&gt; Apply constraints as early as possible in your traversal to reduce the working set.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use indexed properties:&lt;/strong&gt; Index properties you filter on frequently. This dramatically improves query performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache common patterns:&lt;/strong&gt; Identify frequently executed query patterns and cache results with appropriate TTLs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Data Quality
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Implement validation rules:&lt;/strong&gt; Define constraints on entity properties and relationship validity to maintain quality automatically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Track provenance:&lt;/strong&gt; Know where each entity and relationship came from. This enables troubleshooting and quality assessment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable feedback loops:&lt;/strong&gt; Allow users to report incorrect relationships. Use this feedback to improve extraction pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schedule audits:&lt;/strong&gt; Periodically review graph quality, especially for critical relationship paths.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Context Graphs + LLMs: A Powerful Combination
&lt;/h2&gt;

&lt;p&gt;Context Graphs and Large Language Models (LLMs) complement each other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph-Augmented Generation (GAG):&lt;/strong&gt; Retrieve relevant subgraphs from your Context Graph and provide them as structured context to LLMs. This reduces hallucinations and grounds responses in your actual knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-Assisted Graph Construction:&lt;/strong&gt; Use LLMs to extract entities and relationships from unstructured text, building your Context Graph more quickly than rule-based approaches alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explainable LLM Reasoning:&lt;/strong&gt; When LLMs generate responses based on graph context, you can trace exactly which relationships influenced the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid Retrieval:&lt;/strong&gt; Combine vector search (for semantic similarity) with graph traversal (for relationship reasoning) to get the best of both approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring Context Graph Success
&lt;/h2&gt;

&lt;p&gt;Track these metrics to assess your Context Graph implementation:&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Performance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Response time:&lt;/strong&gt; Median and 95th percentile query latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; Queries per second at peak usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache hit rate:&lt;/strong&gt; Percentage of queries served from cache&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Quality
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entity accuracy:&lt;/strong&gt; Percentage of correctly identified entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationship precision:&lt;/strong&gt; Percentage of relationships that are actually valid&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage:&lt;/strong&gt; Percentage of domain knowledge captured in the graph&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time saved:&lt;/strong&gt; Reduction in research/discovery time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy improvement:&lt;/strong&gt; Better decision quality from enhanced reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost reduction:&lt;/strong&gt; Decreased manual effort for knowledge work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User satisfaction:&lt;/strong&gt; NPS or satisfaction scores for graph-powered features&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AI Performance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination rate:&lt;/strong&gt; Reduction in factually incorrect AI outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning accuracy:&lt;/strong&gt; Percentage of multi-hop questions answered correctly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainability:&lt;/strong&gt; Percentage of AI decisions with traceable reasoning paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Future of Context Graphs
&lt;/h2&gt;

&lt;p&gt;Context Graphs are evolving rapidly:&lt;/p&gt;

&lt;h3&gt;
  
  
  Emerging Trends
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph + Vector Hybrid Systems:&lt;/strong&gt; Combining semantic vector search with graph reasoning for more sophisticated AI systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated Schema Evolution:&lt;/strong&gt; ML systems that automatically suggest new entity types and relationships based on usage patterns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-Time Graph Analytics:&lt;/strong&gt; Stream processing for graph updates and real-time pattern detection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Modal Graphs:&lt;/strong&gt; Incorporating images, audio, and video as first-class entities with rich relationships.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Federated Graphs:&lt;/strong&gt; Connecting knowledge graphs across organizational boundaries while maintaining privacy and security.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started with Context Graphs
&lt;/h2&gt;

&lt;p&gt;Ready to implement Context Graphs in your AI systems?&lt;/p&gt;

&lt;h3&gt;
  
  
  Start Small, Think Big
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Identify a high-value use case where relationship reasoning matters&lt;/li&gt;
&lt;li&gt;Map your initial schema with domain experts (10-20 entity types is plenty to start)&lt;/li&gt;
&lt;li&gt;Build a proof of concept with a subset of your data&lt;/li&gt;
&lt;li&gt;Measure impact against your baseline approach&lt;/li&gt;
&lt;li&gt;Iterate and expand based on what you learn&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Common Starting Points
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer support:&lt;/strong&gt; Connect tickets, customers, products, and resolutions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal knowledge:&lt;/strong&gt; Link documents, projects, people, and decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Map regulations, policies, processes, and controls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product development:&lt;/strong&gt; Connect features, dependencies, bugs, and releases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Context Graphs represent a fundamental shift in how AI systems understand and reason about information. By capturing not just data, but the rich network of relationships that gives data meaning, they unlock AI capabilities that were previously unattainable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More accurate reasoning through multi-hop traversal&lt;/li&gt;
&lt;li&gt;Explainable decisions via traceable relationship paths&lt;/li&gt;
&lt;li&gt;Reduced hallucinations by grounding in verifiable connections&lt;/li&gt;
&lt;li&gt;Scalable knowledge management without rigid schema constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As AI becomes increasingly central to enterprise operations, Context Graphs will evolve from competitive advantage to foundational infrastructure. Organizations that build graph-based AI capabilities now will be well-positioned to lead in an AI-driven future.&lt;/p&gt;

&lt;p&gt;The question isn't whether to adopt Context Graphs, it's when and where to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expert Help with Context Graph Implementation
&lt;/h2&gt;

&lt;p&gt;Building Context Graphs requires specialized expertise in graph databases, knowledge representation, and AI integration. CloudRaft provides complimentary AI consultations to help you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assess feasibility for your specific use cases&lt;/li&gt;
&lt;li&gt;Design optimal schemas for your domain&lt;/li&gt;
&lt;li&gt;Architect scalable infrastructure that grows with your needs&lt;/li&gt;
&lt;li&gt;Integrate with existing AI systems seamlessly&lt;/li&gt;
&lt;li&gt;Train your team on graph technologies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between a Context Graph and a Knowledge Graph?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Context Graphs are specialized knowledge graphs optimized for dynamic context assembly in AI systems. While knowledge graphs broadly represent domain knowledge, Context Graphs focus specifically on enabling AI reasoning through relationship traversal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use Context Graphs with vector databases?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Absolutely. Many advanced AI systems use both, vector databases for semantic similarity search and Context Graphs for relationship reasoning. This hybrid approach provides the best of both worlds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much data do I need to start?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can start small. Even a few thousand entities with well-modeled relationships can demonstrate value. Focus on quality relationships over quantity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the typical implementation timeline?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a focused proof of concept: 4-8 weeks. For production-ready implementation: 3-6 months. Timeline depends on data complexity, schema design, and integration requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need specialized graph database skills?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While helpful, they're not mandatory. Graph query languages like Cypher (Neo4j) are learnable, similar to SQL. Consider training existing team members or partnering with experts for initial setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do Context Graphs reduce AI hallucinations?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By grounding AI responses in explicit, verifiable relationships rather than relying solely on probabilistic pattern matching from training data. The AI can only traverse relationships that actually exist in your graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the ROI of implementing Context Graphs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Varies by use case, but organizations typically see: reduction in knowledge discovery time, improvement in AI reasoning accuracy, and reduction in manual research effort. ROI is highest for knowledge-intensive workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Context Graphs work with my existing databases?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Context Graphs complement existing databases. You can keep transactional data in relational databases and build Context Graphs for relationship reasoning, syncing data between systems.&lt;/p&gt;

</description>
      <category>contextgraph</category>
    </item>
    <item>
      <title>Real-Time Postgres to ClickHouse CDC: Supercharge Analytics with PeerDB</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Thu, 27 Nov 2025 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/real-time-postgres-to-clickhouse-cdc-supercharge-analytics-with-peerdb-29k1</link>
      <guid>https://forem.com/cloudraft/real-time-postgres-to-clickhouse-cdc-supercharge-analytics-with-peerdb-29k1</guid>
      <description>&lt;p&gt;If you are running a heavy SaaS platform, you eventually hit a wall with PostgreSQL. It's fantastic for transactional data (OLTP), but when you try to run complex analytical queries on millions of rows, things slow down.&lt;/p&gt;

&lt;p&gt;We recently tackled this exact problem for a client handling high-volume messaging operations. In one of our customers' analytics dashboards, they were using an AWS Aurora PostgreSQL setup to run analytical queries, and they needed a solution that was fast, reliable, and real-time.&lt;/p&gt;

&lt;p&gt;Here is how we solved it by building a high-performance replication pipeline from Postgres to ClickHouse using PeerDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764248123%2Fblogs%2Fpeerdb%2Fanalytics_rskyis.avif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764248123%2Fblogs%2Fpeerdb%2Fanalytics_rskyis.avif" alt="Analytics" width="1918" height="676"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ClickHouse?
&lt;/h2&gt;

&lt;p&gt;ClickHouse is the superior choice for analytics because it is a purpose-built OLAP database designed for high-performance data processing, unlike PostgreSQL, which is a row-based OLTP system better suited for transactional workloads. Its columnar storage architecture allows it to handle massive datasets with sub-millisecond latency, where standard Postgres deployments often hit performance walls. By switching to ClickHouse, you gain the ability to ingest millions of rows and execute complex analytical queries instantly, solving the performance limitations inherent in using PostgreSQL for analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CDC Landscape: Why We Chose PeerDB
&lt;/h2&gt;

&lt;p&gt;Real-time Change Data Capture (CDC) is the standard for moving data without slowing down your primary database. But how do you implement it? Here are the primary CDC (Change Data Capture) options for replicating data from PostgreSQL to ClickHouse that we have considered in our implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. PeerDB
&lt;/h3&gt;

&lt;p&gt;PeerDB is a specialised tool designed specifically for PostgreSQL to ClickHouse replication. It was the chosen solution in the provided design document due to its balance of performance and simplicity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; It can run as a Docker container stack (PeerDB Server, UI, etc.) and connects directly to the Postgres logical replication slot.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Performance:&lt;/strong&gt; PeerDB was &lt;a href="https://docs.peerdb.io/why-peerdb" rel="noopener noreferrer"&gt;found&lt;/a&gt; to be very performant as compared to other solutions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialised Features:&lt;/strong&gt; It handles initial snapshots (bulk loads) and real-time streaming (CDC) seamlessly. It also supports specific optimisations, such as dividing tables into multiple "mirrors" to speed up initial loads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity:&lt;/strong&gt; It avoids the complexity of managing a full Kafka cluster.
&lt;strong&gt;Cons:&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Edition Limits:&lt;/strong&gt; The community edition lacks built-in authentication for the UI, requiring private network access or VPNs for security or another way to add authentication for the UI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Altinity Sink Connector for ClickHouse
&lt;/h3&gt;

&lt;p&gt;This is a lightweight, single-executable solution often used to avoid the complexity of Kafka. It is developed by Altinity, a major ClickHouse contributor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; It runs as a standalone binary or within a Kafka Connect environment. It connects to Postgres and replicates data to ClickHouse.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational Simplicity:&lt;/strong&gt; It eliminates the need for a Kafka Connect cluster or ZooKeeper, running as a single executable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct Replication:&lt;/strong&gt; Offers a direct path from Postgres to ClickHouse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Schema:&lt;/strong&gt; Can automatically read the Postgres schema and create equivalent ClickHouse tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; In the referenced document, this option was tested but rejected because it did not meet the performance requirements compared to PeerDB.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Debezium and Kafka
&lt;/h3&gt;

&lt;p&gt;This is the industry-standard approach for general-purpose CDC, involving a chain of distinct complex components.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; Postgres → Debezium (Kafka Connect) → Kafka Broker → ClickHouse Sink → ClickHouse.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decoupling:&lt;/strong&gt; The message broker (Kafka) decouples the source from the destination, allowing multiple consumers to read the same stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; Extremely robust for guaranteed message delivery and exactly-once processing (if configured correctly).
&lt;strong&gt;Cons:&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Complexity:&lt;/strong&gt; Requires managing Zookeeper, Kafka Brokers, and Schema Registries. The provided document explicitly mentions avoiding "Kafka Connect framework complexity" as a goal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overhead:&lt;/strong&gt; Significant infrastructure footprint compared to direct replication tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why PeerDB?
&lt;/h3&gt;

&lt;p&gt;We initially tested the Altinity connector but ultimately chose PeerDB. Mainly because of following reasons.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; In our testing, PeerDB offered superior performance for our specific workload compared to other connectors we tried.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialisation:&lt;/strong&gt; It is purpose-built for Postgres-to-ClickHouse replication, handling data type mapping and initial snapshots smoothly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;We opted for a "Keep It Simple" approach to infrastructure. While Kubernetes (EKS) is great, we deployed this on Amazon EC2 to maintain full control over the infrastructure and cost. If you have a team that can handle EKS for you, then that might be a better option. Please &lt;a href="https://www.cloudraft.io/contact-us" rel="noopener noreferrer"&gt;discuss with our team&lt;/a&gt; to find the right solutions for your workload and team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source:&lt;/strong&gt; AWS Aurora (PostgreSQL)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Pipeline:&lt;/strong&gt; PeerDB running via Docker Compose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination:&lt;/strong&gt; A ClickHouse cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High Availability Design
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764247641%2Fblogs%2Fpeerdb%2Fclickhouse-architecture_jtgy7m.avif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fdfee67kdq%2Fimage%2Fupload%2Fv1764247641%2Fblogs%2Fpeerdb%2Fclickhouse-architecture_jtgy7m.avif" alt="High Availability Design" width="1920" height="1080"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source: Altinity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To ensure we never lost data, we configured a ClickHouse cluster with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 Keeper Nodes:&lt;/strong&gt; Using m6i.large instances. These replace ZooKeeper for coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 ClickHouse Server Nodes:&lt;/strong&gt; Using r6i.2xlarge instances for heavy lifting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication:&lt;/strong&gt; We used ReplicatedMergeTree to ensure data exists on multiple nodes for safety&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ClickHouse Cluster
&lt;/h3&gt;

&lt;p&gt;We automated the deployment using Ansible to configure the hardware-aware settings. A cool feature of our setup is that the configuration automatically calculates memory limits and cache sizes based on the EC2 instance's RAM (e.g., leaving 25% for the OS and giving 75% to ClickHouse). We wrote about this earlier in our &lt;a href="https://www.cloudraft.io/blog/building-enterprise-grade-clickhouse-with-ansible" rel="noopener noreferrer"&gt;previous blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing PeerDB
&lt;/h3&gt;

&lt;p&gt;We used Docker Compose to spin up the PeerDB stack. One specific nuance we encountered was configuring the storage abstraction. PeerDB uses MinIO (S3 compatible) for intermediate storage. We had to explicitly set the PEERDB_CLICKHOUSE_AWS_CREDENTIALS_AWS_ENDPOINT_URL_S3 environment variable in our &lt;code&gt;docker-compose.yml&lt;/code&gt; to point to our MinIO host IP.&lt;/p&gt;

&lt;p&gt;Set up the peers to connect with the source and destination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating the "Mirror"
&lt;/h3&gt;

&lt;p&gt;PeerDB uses a concept called &lt;strong&gt;Mirrors&lt;/strong&gt; to handle the CDC pipeline. We set up the connection by defining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Peer (Source):&lt;/strong&gt; Our Aurora Postgres instance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Peer (Destination):&lt;/strong&gt; Our ClickHouse cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Mirror:&lt;/strong&gt; The actual replication job&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PeerDB support different modes of streaming - log based (CDC), cursor based (timestamp or integer) and XMIN based. In our implementation, we used log based (CDC) replication.&lt;/p&gt;

&lt;p&gt;To optimise the initial data load, we didn't just dump everything at once. We divided our tables into multiple "batches" (mirrors) to run in parallel and started at different times so that we do not cause a high load on the source.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Gotchas" From the Trenches
&lt;/h2&gt;

&lt;p&gt;No migration is perfect. Here are three issues we faced so you can avoid them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The "Too Many Parts" Error in ClickHouse&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;ClickHouse loves big batches of data. If PeerDB syncs records one by one or in tiny groups too quickly, ClickHouse can't merge the data parts fast enough in the background. We saw errors like Too many parts... Merges are processing significantly slower than inserts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fix:&lt;/em&gt; You may need to tune the batch size or frequency to slow down the inserts slightly, allowing ClickHouse's merge process to catch up.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Aurora Failovers Break Things&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If AWS Aurora triggers a failover, the IP/DNS resolution might shift. We found that this can break the peering connection.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fix:&lt;/em&gt; You have to edit the peer configuration to point to the new primary host and resync the mirror.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Security on Community Edition
We used the community edition of PeerDB. Be aware that it does not have built-in authentication for the UI.&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fix:&lt;/em&gt; Do not expose the UI to the public internet. We access it via private IP/VPN or put an authentication layer using a third-party product.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion and Key Takeaways
&lt;/h2&gt;

&lt;p&gt;By successfully moving analytical queries off the primary Postgres instance and into ClickHouse, we achieved the sub-millisecond query performance our client required. PeerDB provided us with a robust, real-time CDC solution without the operational headache of managing a Kafka cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways on the Postgres + ClickHouse + PeerDB Combination:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; You get the best of both worlds: PostgreSQL handles fast, reliable transactional (OLTP) workloads, while ClickHouse takes on complex analytical (OLAP) queries with unmatched speed. This separation prevents slow analytical queries from impacting your core application database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Simplicity:&lt;/strong&gt; PeerDB acts as a purpose-built, high-performance bridge. It removes the need to deploy and manage a complex, multi-component CDC stack like Debezium and Kafka, significantly reducing infrastructure complexity and operational overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; This architecture allows your analytics layer (ClickHouse) to scale independently from your transactional layer (Postgres), ensuring that as your data volumes grow, you maintain both OLTP stability and OLAP speed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-Effectiveness:&lt;/strong&gt; By offloading analytical processing, you can often run a smaller, more cost-effective Postgres instance dedicated to its core function, while leveraging ClickHouse's efficiency for massive-scale querying.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Are you looking to improve your analytics pipeline? Please &lt;a href="https://www.cloudraft.io/contact-us" rel="noopener noreferrer"&gt;book a call&lt;/a&gt; with us to discuss your case.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>analytics</category>
      <category>clickhouse</category>
      <category>peerdb</category>
    </item>
    <item>
      <title>Why high performance storage is important for AI Cloud Build</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Wed, 24 Sep 2025 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/why-high-performance-storage-is-important-for-ai-cloud-build-3ok2</link>
      <guid>https://forem.com/cloudraft/why-high-performance-storage-is-important-for-ai-cloud-build-3ok2</guid>
      <description>&lt;p&gt;The AI cloud market is experiencing exceptionally rapid growth worldwide, with the latest reports projecting annual growth rates between 28% and 40% over the next five years. It may reach up to $647 billion by 2030 as per various analyst reports. The surge in AI Cloud adoption, GPU-as-a-service platforms, and enterprise interest in AI “factories” has created new pressures and opportunities for product engineering and IT leaders. Regardless of which public cloud or private cluster you choose, one key differentiator sets each AI and HPC solution apart: the &lt;strong&gt;performance of storage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While leading clouds often use the same GPUs and servers, the way data flows—between compute, network, storage, and persistent layers—determines everything from training speed to scalability. Understanding storage fundamentals will help you architect or select the right solution. We have previously covered &lt;a href="https://www.cloudraft.io/blog/how-to-build-ai-cloud" rel="noopener noreferrer"&gt;how to build AI cloud&lt;/a&gt; solutions and with hands-on experience in this space, we would like to cover our thoughts around it in this article.&lt;/p&gt;

&lt;p&gt;Business and technology leaders now recognize that real-world AI breakthroughs require infrastructure with high bandwidth, low latency, and extreme parallelism. As deep learning and data-intensive analytics move from labs to production, GPU clusters run ever-larger models on ever-growing datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does Storage Matter in AI Workloads?
&lt;/h2&gt;

&lt;p&gt;Storage plays an important role across the entire AI lifecycle. Let’s look into all three major areas: data preparation, training &amp;amp; tuning, and inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Preparation
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Key Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Scalable and performant storage to support transforming data for AI use&lt;/li&gt;
&lt;li&gt;Protecting valuable raw and derived training data sets&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Critical Capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Storing large structured and unstructured datasets in many formats&lt;/li&gt;
&lt;li&gt;Scaling under the pressure of map-reduce like distributed processing often used for transforming data for AI&lt;/li&gt;
&lt;li&gt;Support for file and object access protocols to ease integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training &amp;amp; Tuning
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Key Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Providing training data to keep expensive GPUs fully utilized&lt;/li&gt;
&lt;li&gt;Saving and restoring model checkpoints to protect training investments&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Critical Capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Sustaining read bandwidths necessary to keep training GPU resources busy&lt;/li&gt;
&lt;li&gt;Minimizing time to save checkpoint data to limit training pauses&lt;/li&gt;
&lt;li&gt;Scaling to meet demands of data parallel training in large clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Inference
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Key Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Safely storing and quickly delivering model artifacts for inference services&lt;/li&gt;
&lt;li&gt;Providing data for batch inferencing&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Critical Capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Reliably storing expensive to produce model artifact data&lt;/li&gt;
&lt;li&gt;Minimizing model artifact read latency for quick inference deployment&lt;/li&gt;
&lt;li&gt;Sustaining read bandwidths necessary to keep inference GPU resources busy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High Performance Storage is Critical in Checkpointing Process in AI Training
&lt;/h3&gt;

&lt;p&gt;Checkpointing is a critical process in large-scale AI training, enabling models to periodically save and restore their state as training progresses. As model and dataset sizes expand into the billions of parameters and petabytes of data, this operation becomes increasingly demanding for storage infrastructure. Efficient checkpointing helps safeguard training progress against inevitable hardware failures and disruptions, while also allowing for fine-tuning, experimentation, and rapid recovery. However, frequent checkpointing can introduce performance overhead due to pauses in computation and intensive reads/writes to persistent storage, especially when distributed clusters grow to thousands of accelerators.&lt;/p&gt;

&lt;p&gt;To address these challenges, modern AI storage architecture leverages strategies such as asynchronous checkpointing—where checkpoints are saved in the background, minimizing idle time—and hierarchical distribution, reducing bottlenecks by having leader nodes manage data transfers within clusters. The result is faster training throughput, lower risk of lost work, and more efficient use of compute resources. Optimizing for checkpoint size, frequency, and concurrent access patterns is vital to ensure high throughput and low latency, making high-performance scalable storage systems an indispensable foundation for reliable, cost-effective AI model training at scale. You can read more about it in this &lt;a href="https://aws.amazon.com/blogs/storage/architecting-scalable-checkpoint-storage-for-large-scale-ml-training-on-aws/" rel="noopener noreferrer"&gt;AWS article&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kind of Storage Is Needed for AI and HPC Workloads?
&lt;/h2&gt;

&lt;p&gt;For AI and HPC workloads, the demands extend well beyond ordinary enterprise storage. Key requirements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallel File Systems:&lt;/strong&gt; Multiple servers and GPUs need to access datasets at the same time. Systems such as Lustre, WEKA, VAST Data, CephFS, and DDN Infinia enable concurrent access, avoiding bottlenecks and improving throughput for distributed workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Throughput and Low Latency:&lt;/strong&gt; Training GPT-like models or running simulations generates millions of read/write operations per second. Storage must deliver bandwidth in the tens to hundreds of GB/s and latency below 1ms, so that GPUs remain fed and productive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POSIX Compliance:&lt;/strong&gt; Many AI frameworks and HPC applications expect a traditional POSIX interface for seamless operation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability and Elasticity:&lt;/strong&gt; Petabyte-scale capacity is the norm. Modern solutions allow you to scale horizontally, adding performance and capacity as demand grows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Integrity and Reliability:&lt;/strong&gt; Enterprise-grade AI and HPC workloads need uninterrupted access to their data. Redundancy, fault tolerance, and robust disaster recovery features matter.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Typical Storage Specifications and Requirements
&lt;/h2&gt;

&lt;p&gt;For modern AI Cloud or AI factory, and GPU Cloud infrastructure, expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bandwidth:&lt;/strong&gt; 15–512 GB/s (or higher for top-tier solutions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IOPS:&lt;/strong&gt; From 20,000 (entry) up to 800,000+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; Sub-1ms to 2ms for parallel file systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity:&lt;/strong&gt; 100TB to multi-petabyte scale, often with tiering to object storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protocols:&lt;/strong&gt; NFSv3/v4.1, SMB, Lustre, S3 (for hybrid and archival storage), HDFS, and native REST APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On-premises or hybrid deployments may include NVMe storage, CXL-enabled expansion, and advanced cooling for supporting high-density GPU clusters.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;AI Lifecycle Stage&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Requirements&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Considerations&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reading Training Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Accommodate wide range of read BW requirements and IO access patterns across different AI models &lt;br&gt; - Deliver large amounts of read BW to single GPU servers for most demanding models&lt;/td&gt;
&lt;td&gt;- Use high performance, all-flash storage to meet needs &lt;br&gt; - Leverage RDMA capable storage protocols, when possible, for most demanding requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Saving Checkpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Provide large sequential write bandwidth for quickly saving checkpoints&lt;br&gt; - Handle multiple large sequential write streams to separate files, especially in same directory&lt;/td&gt;
&lt;td&gt;- Understand checkpoint implementation details and behaviors for expected AI workloads&lt;br&gt; - Determine time limits for completing checkpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Restoring Checkpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Provide large sequential read bandwidth for quickly restoring checkpoints &lt;br&gt; - Handle multiple large sequential read streams to same checkpoint file&lt;/td&gt;
&lt;td&gt;- Understand how often checkpoint restoration will be required &lt;br&gt; - Determine acceptable time limits for restoration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Servicing GPU Clusters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;- Meet performance requirements for mixed storage workloads from multiple simultaneous AI jobs &lt;br&gt; - Scale capacity and performance as GPU clusters grow with business needs&lt;/td&gt;
&lt;td&gt;- Consider scale-out storage platforms that can increase performance and capacity while providing shared access to data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: snia.org - John Cardente Talk&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage options for AI Cloud and HPC Workloads
&lt;/h2&gt;

&lt;p&gt;To achieve next-generation AI and HPC results, enterprises and product teams should evaluate both commercial vendors and open source platforms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open Source Parallel File Systems
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ceph (CephFS):&lt;/strong&gt; Highly flexible, POSIX-compliant, scales from small clusters to exabytes. Used in academic and commercial AI labs for robust file and object storage. Many early stage AI factories are using solutions built on top of Ceph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lustre / DDN Lustre:&lt;/strong&gt; Optimized for large-scale HPC and AI workloads. Used in many supercomputing and enterprise environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IBM Spectrum Scale (GPFS):&lt;/strong&gt; High-performing parallel file system, widely used in science and industry.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Commercial AI and HPC Storage Solutions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VAST Data:&lt;/strong&gt; Delivers extreme performance for AI storage, marrying parallel file system performance with the economics of NAS and archive. Vast has been very popular and adapted by popular AI Cloud players like &lt;a href="https://www.vastdata.com/customers/coreweave" rel="noopener noreferrer"&gt;CoreWeave&lt;/a&gt; and Lambda.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WEKA:&lt;/strong&gt; Highly optimized metadata and file access for AI and multi-tenant clusters; helps overcome bottlenecks experienced in legacy systems. Similar to Vast, Weka has customers such as Yotta, Cohere, and &lt;a href="http://Together.ai" rel="noopener noreferrer"&gt;Together.ai&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDN:&lt;/strong&gt; Industry leader for research, hybrid file-object storage, and scalable data intelligence for model training and analytics. DDN’s solutions, like Infinia and xFusionAI, focus on both performance and efficiency for GPU workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pure Storage, Cloudian, IBM, Dell:&lt;/strong&gt; Also recognized for delivering enterprise-grade AI/HPC storage platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many solutions integrate natively with popular public clouds (AWS S3, Google Cloud Storage, Azure Blob)—enabling hybrid architectures and seamless data movement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Product Examples and Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ceph (Open Source):&lt;/strong&gt; Used by research labs and private cloud teams to build petabyte-scale, resilient storage for AI and HPC clusters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WEKA:&lt;/strong&gt; Enterprise deployments often leverage WEKA for AI factories—a system with hundreds of GPUs running concurrent training jobs—thanks to its elastic scaling and metadata performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAST Data:&lt;/strong&gt; Designed to deliver high throughput for both small and large file operations, increasingly chosen for generative AI workloads and data-intensive analytics in fintech, healthcare, and media.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDN:&lt;/strong&gt; Supports hybrid deployment strategies; offers both parallel file system and object storage in a unified stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parallel file systems such as Lustre and Spectrum Scale facilitate near-instant recovery, zero-data loss architectures, and compliance for regulated sectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying the Best Storage for your needs
&lt;/h2&gt;

&lt;p&gt;Because every cloud environment is unique, the first step in creating a distinctive solution is to establish a baseline through hardware benchmarking. MLCommons' benchmarking tools can be run directly on your hardware to gather reliable performance data.&lt;/p&gt;

&lt;p&gt;The latest MLPerf Storage v2.0 &lt;a href="https://mlcommons.org/benchmarks/storage/" rel="noopener noreferrer"&gt;benchmark results&lt;/a&gt; from MLCommons highlight the increasingly critical role of storage performance in the scalability of AI training systems. With participation nearly doubling compared to the previous v1.0 round, the industry’s rapid innovation is evident—storage solutions now support around twice the number of accelerators as before. The new iteration includes checkpointing benchmarks, which address real-world scenarios faced by large AI clusters, where frequent hardware failures can disrupt training jobs. By simulating such events and evaluating storage recovery speeds, MLPerf Storage v2.0 offers valuable insights into how checkpointing helps ensure uninterrupted performance in sprawling datacenter environments.&lt;/p&gt;

&lt;p&gt;A broad spectrum of storage technologies took part in the benchmark—ranging from local storage, in-storage accelerators, to object stores—reflecting the diversity of approaches in AI infrastructure. Over 200 results were submitted by 26 organizations worldwide, many participating for the first time, which showcases the growing global momentum behind the MLPerf initiative. The benchmarking framework—open-source and rigorously peer-reviewed—provides unbiased, actionable data for system architects, datacenter managers, and software vendors. MLPerf Storage is a go-to resource for designing resilient, high-performance AI training systems in a rapidly evolving technology landscape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Building Your AI Cloud and HPC Strategy
&lt;/h2&gt;

&lt;p&gt;As the AI Cloud, GPU-as-a-service, and HPC landscape evolves, storage is no longer a background detail—it is the core differentiator for speed, scale, and future innovation. Vendor neutrality empowers you to architect best-of-breed systems, leveraging open-source foundations and integrating commercial solutions where they fit your needs. Every cloud or on-prem cluster will benefit from storage designed for AI and HPC, not just traditional workloads.&lt;/p&gt;

&lt;p&gt;Ready for the next step? If you want to explore options, benchmark solutions, or design an optimized AI/HPC cloud, &lt;a href="https://cal.com/cloudraft/consulting" rel="noopener noreferrer"&gt;book a meeting&lt;/a&gt; with the CloudRaft team. Our experts bring hands-on experience from enterprise projects, migration strategies, and multi-vendor deployments, helping you maximize both infrastructure and business outcomes. Read more about our &lt;a href="https://www.cloudraft.io/ai-cloud-consulting" rel="noopener noreferrer"&gt;offering&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aicloud</category>
      <category>ai</category>
      <category>infrastructure</category>
      <category>storage</category>
    </item>
    <item>
      <title>Expert Guide on Selecting Observability Products</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Sat, 13 Jul 2024 00:00:00 +0000</pubDate>
      <link>https://forem.com/cloudraft/expert-guide-on-selecting-observability-products-l9d</link>
      <guid>https://forem.com/cloudraft/expert-guide-on-selecting-observability-products-l9d</guid>
      <description>&lt;h2&gt;
  
  
  Guide to select Observability tools and products
&lt;/h2&gt;

&lt;p&gt;In today's digital landscape, businesses are constantly striving to stay ahead of the curve. The ability to deliver exceptional customer experiences, maintain system reliability, and optimize performance has become a crucial differentiator. Enter observability – the linchpin of modern IT operations that empowers organizations to achieve operational excellence, drive cost-efficiency, and continuously enhance their services.&lt;/p&gt;

&lt;p&gt;The rise of cloud-native architectures has revolutionized the way applications are built and deployed. These modern systems leverage dynamic, virtualized infrastructure to provide unparalleled flexibility and automation. By enabling on-demand scaling and global accessibility, cloud-native approaches have become a catalyst for innovation and agility in the business world.&lt;/p&gt;

&lt;p&gt;However, this shift brings new challenges. Unlike traditional monolithic systems, cloud-native applications are composed of numerous microservices distributed across various teams, platforms, and geographic locations. This decentralized nature makes it increasingly complex to monitor and maintain system health effectively.&lt;/p&gt;

&lt;p&gt;In this article, we'll explore the essential characteristics of a robust observability solution and provide guidance on selecting the right tools to meet your organization's unique needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evolution in Observability Space
&lt;/h2&gt;

&lt;p&gt;The evolution of observability over the last two decades has been characterized by significant technological advancements and changing industry needs. Let's explore this journey in more detail:&lt;/p&gt;

&lt;p&gt;In the early 2000s, observability faced its first major challenge with the explosion of log data. Organizations struggled with a lack of comprehensive solutions for instrumenting, generating, collecting, and visualizing this information. This gap in the market led to the rise of Splunk, which quickly became a dominant player by offering robust log management capabilities. As the decade progressed, the rapid growth of internet-based services and distributed systems introduced new complexities. This shift necessitated more sophisticated Application Performance Management (APM) solutions, paving the way for industry leaders like DynaTrace, New Relic, and AppDynamics to emerge and address these evolving needs.&lt;/p&gt;

&lt;p&gt;The dawn of the 2010s brought about a paradigm shift with the advent of microservices architecture and cloud computing. These technologies dramatically increased the complexity of IT environments, creating a demand for observability solutions that prioritized developer experience. This wave saw the birth of innovative platforms such as DataDog, Grafana, Sentry, and &lt;a href="https://www.cloudraft.io/prometheus-consulting" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, each offering unique approaches to monitoring and visualizing system performance. As we moved into the latter half of the decade, the industry faced a new challenge: skyrocketing observability costs due to the massive ingestion of Metrics, Events, Logs, and Traces (MELT). While monitoring capabilities had greatly improved, debugging remained a largely manual and time-consuming process, especially in the face of increasingly complex Kubernetes and serverless architectures. Some products like Datadog, Grafana, SigNoz, &lt;a href="https://www.cloudraft.io/blog/cloudraft-kloudmate-partnership" rel="noopener noreferrer"&gt;KloudMate&lt;/a&gt;, Honeycomb, Kloudfuse, &lt;a href="https://www.cloudraft.io/thanos-support" rel="noopener noreferrer"&gt;Thanos&lt;/a&gt;, Coroot, and VictoriaMetrics tackled these new challenges head-on.&lt;/p&gt;

&lt;p&gt;The early to mid-2020s have ushered in a new era of observability, characterized by innovative approaches to data storage and analysis. Industry standards like OpenTelemetry have gained widespread adoption, and products are now aligning with this standard. To optimize costs, observability pipelines are being used to filter and route data to various backends, automatically handling high cardinality data that was often a pain point at scale. We've also seen the adoption of high-performance databases like &lt;a href="https://www.cloudraft.io/clickhouse-consulting" rel="noopener noreferrer"&gt;ClickHouse&lt;/a&gt; for &lt;a href="https://clickhouse.com/use-cases/logging-and-metrics" rel="noopener noreferrer"&gt;monitoring purposes&lt;/a&gt;, often becoming the backend of choice for observability products. The emergence of eBPF technology has provided deep insights into system performance and inter-entity relationships. Due to the increased adoption of the Rust programming language for its high performance, some observability tools such as Vector and various agents have become lightweight and more efficient, allowing for further scalability. Products like Quickwit (&lt;a href="https://quickwit.io/blog/quickwit-binance-story" rel="noopener noreferrer"&gt;see how Binance is storing 100PB logs&lt;/a&gt;) have introduced cost-effective and scalable solutions for storing logs and metrics directly on object storage. Perhaps most significantly, we're witnessing the integration of artificial intelligence into observability tools, enabling causal analysis and faster problem resolution. This AI-driven approach is helping organizations quickly narrow down issues in their increasingly complex environments, marking a new frontier in the observability landscape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Systems are getting Complex
&lt;/h2&gt;

&lt;p&gt;In the realm of modern, distributed systems, traditional monitoring approaches fall short. These conventional methods rely on predetermined failure scenarios, which prove inadequate when dealing with the intricate, interconnected nature of today's cloud-based architectures. The unpredictability of these complex systems demands a more sophisticated approach to observability.&lt;/p&gt;

&lt;p&gt;Enter the new generation of cloud monitoring tools. These advanced solutions are designed to navigate the labyrinth of distributed systems, drawing connections between seemingly disparate data points without the need for explicit configuration. Their power lies in their ability to uncover hidden issues and correlate information across various contexts, providing a holistic view of system health.&lt;/p&gt;

&lt;p&gt;Consider this scenario: a user reports an error in a mobile application. In a world of microservices, pinpointing the root cause can be like finding a needle in a haystack. However, with these cutting-edge monitoring tools, engineers can swiftly trace the issue back to its origin, even if it's buried deep within one of countless backend services. This capability not only accelerates root cause analysis but also significantly reduces mean time to resolution (MTTR).&lt;/p&gt;

&lt;p&gt;But the benefits don't stop at troubleshooting. These tools can play a crucial role in refining deployment strategies. By providing real-time feedback on new rollouts, they enable more sophisticated deployment techniques such as canary releases or blue-green deployments. This proactive approach allows for automatic rollbacks of problematic changes, mitigating potential issues before they impact end-users.&lt;/p&gt;

&lt;p&gt;As the cloud-native landscape continues to evolve, selecting the right monitoring stack becomes paramount. To maximize the benefits of modern observability, it's crucial to choose a solution that not only meets your current needs but also aligns with your future goals and the ever-changing demands of cloud-based architectures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Essential Features of Robust Observability Solutions
&lt;/h2&gt;

&lt;p&gt;In today's complex digital landscapes, selecting the right observability tools is crucial. Let's explore the key attributes that make an observability solution truly effective that aligns with the observability best practices.&lt;/p&gt;

&lt;h3&gt;
  
  
  Holistic Monitoring Capabilities
&lt;/h3&gt;

&lt;p&gt;A comprehensive observability platform should adeptly handle the four pillars of telemetry data, collectively known as MELT:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics: Quantitative indicators of system health, such as CPU utilization&lt;/li&gt;
&lt;li&gt;Events: Significant system occurrences or state changes&lt;/li&gt;
&lt;li&gt;Logs: Detailed records of system activities and operations&lt;/li&gt;
&lt;li&gt;Traces: Request pathways through the system, illuminating performance bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An ideal solution seamlessly integrates these data types, providing a cohesive view of your system's health.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intelligent Data Analysis and Anomaly Detection
&lt;/h3&gt;

&lt;p&gt;Modern systems often exhibit unpredictable behavior patterns, rendering static alert thresholds ineffective. Advanced observability tools employ machine learning to detect anomalies without explicit configuration, while still allowing for customization. By correlating anomalies across various telemetry types, these systems can perform automated root cause analysis, significantly reducing troubleshooting time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sophisticated Alerting and Incident Management
&lt;/h3&gt;

&lt;p&gt;Real-time alerting is the backbone of effective observability. A top-tier solution should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert on both customizable thresholds and AI-detected anomalies&lt;/li&gt;
&lt;li&gt;Consolidate related alerts into actionable incidents&lt;/li&gt;
&lt;li&gt;Enrich incidents with contextual data, runbooks, and team information&lt;/li&gt;
&lt;li&gt;Intelligently route incidents to appropriate personnel&lt;/li&gt;
&lt;li&gt;Trigger automated remediation workflows when applicable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To combat alert fatigue, the system should employ intelligent alert suppression, prioritization, and escalation mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data-Driven Insights
&lt;/h3&gt;

&lt;p&gt;Analytics derived from telemetry data drive continuous improvement. Key metrics to track include Mean Time to Repair (MTTR), Mean Time to Acknowledge (MTTA), and various Service Level Objectives (SLOs). These insights facilitate post-incident analysis, helping teams prevent future issues and optimize system performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extensive Integration Ecosystem
&lt;/h3&gt;

&lt;p&gt;A versatile observability solution should seamlessly integrate with your entire tech stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Popular programming languages and frameworks&lt;/li&gt;
&lt;li&gt;Open-source standards (OpenTelemetry, OpenMetrics, StatsD)&lt;/li&gt;
&lt;li&gt;Container orchestration platforms (Docker, Kubernetes)&lt;/li&gt;
&lt;li&gt;Security tools for vulnerability scanning&lt;/li&gt;
&lt;li&gt;Incident management systems&lt;/li&gt;
&lt;li&gt;CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Major cloud platforms&lt;/li&gt;
&lt;li&gt;Team collaboration tools&lt;/li&gt;
&lt;li&gt;Business intelligence platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability and Cost Optimization
&lt;/h3&gt;

&lt;p&gt;As applications grow in scale and complexity, managing observability costs becomes challenging. Look for tools that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify underutilized resources and forecast future needs&lt;/li&gt;
&lt;li&gt;Employ intelligent data sampling and retention policies&lt;/li&gt;
&lt;li&gt;Efficiently handle high-cardinality data&lt;/li&gt;
&lt;li&gt;Utilize cutting-edge technologies like eBPF for improved performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Intuitive User Experience
&lt;/h3&gt;

&lt;p&gt;An observability platform's UI/UX is critical for efficient debugging and insight gathering. Seek solutions offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear visualizations of system components and their relationships&lt;/li&gt;
&lt;li&gt;Pre-configured dashboards for common scenarios&lt;/li&gt;
&lt;li&gt;Easy integration with your existing stack&lt;/li&gt;
&lt;li&gt;Comprehensive, user-friendly documentation&lt;/li&gt;
&lt;li&gt;Ability to slice and dice visualizations and fast response time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operational Simplicity
&lt;/h3&gt;

&lt;p&gt;Scaling observability across an organization can be daunting. Look for platforms that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support "everything-as-code" for standardization and version control&lt;/li&gt;
&lt;li&gt;Integrate smoothly with modern application platforms&lt;/li&gt;
&lt;li&gt;Offer automation-friendly interfaces&lt;/li&gt;
&lt;li&gt;Provide tools for managing observability at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost-Effective Data Management
&lt;/h3&gt;

&lt;p&gt;As data volumes grow, intelligent data lifecycle management becomes crucial. Seek solutions offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-tiered storage for different data types&lt;/li&gt;
&lt;li&gt;Advanced compression and deduplication techniques&lt;/li&gt;
&lt;li&gt;Intelligent data sampling strategies&lt;/li&gt;
&lt;li&gt;Efficient handling of high-cardinality data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Alignment with Industry Standards
&lt;/h3&gt;

&lt;p&gt;Choosing tools that support industry-standard protocols and frameworks (like OpenTelemetry, PromQL, and Grafana) ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easier integration with existing systems&lt;/li&gt;
&lt;li&gt;Vendor-independent implementations&lt;/li&gt;
&lt;li&gt;Flexibility to change backends without code modifications&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Organizational Fit
&lt;/h3&gt;

&lt;p&gt;When selecting an observability solution, consider your organization's unique needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System complexity and scale&lt;/li&gt;
&lt;li&gt;User base characteristics&lt;/li&gt;
&lt;li&gt;Budget constraints&lt;/li&gt;
&lt;li&gt;Team skills and expertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prioritize platforms that cover your full stack, tying surface-level symptoms to root causes. Ensure the chosen solution integrates seamlessly with your current tech stack, DevSecOps processes, and team workflows. The ideal observability solution balances comprehensive insights with practical considerations, providing a powerful yet feasible tool for your organization's needs. Ideally, you want one or a few tools that are as effective as possible to justify their costs; you also want to avoid context switching. Let’s look at the key features of an effective application monitoring tool. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Selecting the ideal observability solution is a nuanced process that demands a deep understanding of your organization's unique ecosystem. It's not just about collecting data; it's about gaining actionable insights that drive meaningful improvements in your systems and processes.&lt;/p&gt;

&lt;p&gt;The journey to effective observability requires a careful balance between comprehensive coverage and practical implementation. Your chosen solution should seamlessly integrate with your existing tech stack, enhancing rather than disrupting your current workflows. It's crucial to find a tool that not only provides rich, full-stack visibility but also aligns with your team's skills, your budget constraints, and your overall operational goals.&lt;/p&gt;

&lt;p&gt;Remember, observability is a double-edged sword. When implemented effectively, it can provide unprecedented insights into your systems, enabling proactive problem-solving and continuous improvement. However, if not approached thoughtfully, it can lead to unnecessary complexity, spiraling costs, and a false sense of security. The risk of "running half blind" with suboptimal observability practices is real and can have significant implications for your operations and bottom line.&lt;/p&gt;

&lt;p&gt;In this complex landscape, partnering with experts can make all the difference. CloudRaft, with &lt;a href="https://www.cloudraft.io/observability-consulting" rel="noopener noreferrer"&gt;its deep expertise in observability&lt;/a&gt; and extensive partnerships in the field, stands ready to guide you through this journey. Our experience can help you rapidly adopt and optimize modern observability practices, ensuring you reap the full benefits of these powerful tools without falling into common pitfalls.&lt;/p&gt;

&lt;p&gt;By choosing the right observability solution and implementation approach, you're not just collecting data – you're empowering your team with the insights they need to drive innovation, enhance performance, and deliver exceptional user experiences. In today's fast-paced digital environment, that's not just an advantage – it's a necessity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authors&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anjul Sahu&lt;/strong&gt;: &lt;a href="https://www.linkedin.com/in/anjul" rel="noopener noreferrer"&gt;Anjul&lt;/a&gt; is a leading expert in observability and a thought leader. In the last one and half decades, he has seen all the waves, of how observability and monitoring have evolved in large-scale organizations such as Telcos, Banks, and Internet Startups. He also works with investors and product companies looking for advice on the current trends in observability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Madhukar Mishra&lt;/strong&gt;: &lt;a href="https://www.linkedin.com/in/madhukar-mishra-b55593b8/" rel="noopener noreferrer"&gt;Madhukar&lt;/a&gt; has over one decade of experience, building up the platform for a leading e-commerce company in India to a company that built Internet-scale products. He is interested in large-scale distributed systems and is a thought leader in developer productivity and SRE.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>observability</category>
      <category>cloudraft</category>
      <category>opentelemetry</category>
      <category>thanos</category>
    </item>
    <item>
      <title>Secure Coding Best Practices</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Sat, 17 Jun 2023 13:19:12 +0000</pubDate>
      <link>https://forem.com/cloudraft/secure-coding-best-practices-2c62</link>
      <guid>https://forem.com/cloudraft/secure-coding-best-practices-2c62</guid>
      <description>&lt;p&gt;Every single day, an extensive array of fresh software vulnerabilities is unearthed by diligent security researchers and analysts. A considerable portion of these vulnerabilities emerges due to the absence of secure coding practices. Exploiting such vulnerabilities can have severe consequences, as they possess the potential to severely impair the financial or physical assets of a business, erode trust, or disrupt critical services.&lt;/p&gt;

&lt;p&gt;For organisations reliant on their software for their operations, it becomes imperative for software developers to embrace secure coding practices. Secure coding entails a collection of practices that software developers adopt to fortify their code against cyberattacks and vulnerabilities. By adhering to coding standards that embody best practices, developers can incorporate safeguards that minimise the risks posed by vulnerabilities in their code.&lt;/p&gt;

&lt;p&gt;In a world brimming with cyber threats, secure coding cannot be viewed as optional if a business intends to maintain its shield of protection.&lt;/p&gt;

&lt;p&gt;This article, we will explore some anti-patterns and best practices we can include in our workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-patterns
&lt;/h2&gt;

&lt;p&gt;Now, let's briefly discuss various common mistakes or anti-patterns, categorised into insecure coding. The following are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insufficient validation of input data or processing inputs without proper encoding or sanitisation.&lt;/li&gt;
&lt;li&gt;Constructing SQL queries by concatenating strings, making the code vulnerable to data leaks or injection attacks.&lt;/li&gt;
&lt;li&gt;Failure to implement robust authentication, such as storing credentials in plain text without proper hashing and encryption.&lt;/li&gt;
&lt;li&gt;Poor design of password recovery mechanisms and infrequent rotation of security keys.&lt;/li&gt;
&lt;li&gt;Software planning and design lacking strong authorisation schemes.&lt;/li&gt;
&lt;li&gt;Granting excessive privileges during development or troubleshooting.&lt;/li&gt;
&lt;li&gt;Exposing sensitive information in debug logging without appropriate redaction.&lt;/li&gt;
&lt;li&gt;Utilising third-party libraries from untrusted sources or neglecting security checks.&lt;/li&gt;
&lt;li&gt;Unsafe handling of memory pointers or allowing pointer access beyond system boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these common mistakes in mind, let's explore practices and tools that can guide developers towards secure coding practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secure Coding Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Shift left in software development lifecycle
&lt;/h3&gt;

&lt;p&gt;Historically, the conventional practice involved assigning the software security team to conduct security testing towards the conclusion of a software development project. The team would assess the application and compile a list of issues that require resolution. At this stage, the identified fixes would be prioritised, resulting in some vulnerabilities being addressed while others remained unattended. The reasons for leaving certain vulnerabilities unresolved could range from cost constraints and limited resources to pressing business priorities.&lt;/p&gt;

&lt;p&gt;However, this traditional approach is no longer sustainable. Security considerations must now be incorporated right from the outset—the initial stages—of the software development lifecycle. Security should be taken into account during the design phase itself. Both manual and automated testing should be conducted throughout the application's implementation as part of the Continuous Integration (CI) pipeline, ensuring that developers receive prompt feedback.&lt;/p&gt;

&lt;p&gt;To aid in this endeavour, the utilisation of static code analysis becomes invaluable. This technique enables the scanning of code for security flaws and risks, even while developers are actively writing it within an integrated development environment (IDE). For instance, SAST tools offers the ability to analyse the code for security vulnerabilities during the development process, facilitating early identification and mitigation of potential risks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Input validation
&lt;/h3&gt;

&lt;p&gt;Ensuring the integrity of input data as it enters a system holds great significance. It is essential to validate the syntactic and semantic accuracy of all incoming data, considering it as untrusted. Employing checks and regular expressions aids in verifying the correctness, size, and syntax of the input.&lt;/p&gt;

&lt;p&gt;Performing these validations on the server side is highly recommended. In the case of web applications, it involves scrutinising various components, including HTTP headers, cookies, GET and POST parameters, as well as file uploads.&lt;/p&gt;

&lt;p&gt;Client-side validation also proves beneficial, contributing to an enhanced user experience by reducing the need for multiple network requests resulting from invalid inputs. This approach minimises back-and-forth communication and enhances efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameterised queries
&lt;/h3&gt;

&lt;p&gt;During the process of storing and retrieving data, developers frequently engage with datastores. However, if they overlook the utilisation of parametrised queries, it can expose an opportunity for attackers to exploit widely accessible tools and manipulate inputs to extract sensitive information. SQL injection, a highly perilous application risk, exemplifies a common form of such attacks.&lt;/p&gt;

&lt;p&gt;By incorporating placeholders for parameters within the query, the specified parameters are treated as data rather than being considered as part of the SQL command itself. To mitigate these vulnerabilities, it is recommended to employ prepared statements or object-relational mapping (ORM) techniques. These approaches offer effective measures to safeguard against SQL injection and related threats.&lt;/p&gt;

&lt;h3&gt;
  
  
  Encoding data
&lt;/h3&gt;

&lt;p&gt;Encoding data plays a vital role in mitigating threats by transforming potentially hazardous special characters into a sanitised format. Base64 encoding serves as an exemplar of such encoding techniques, offering protection against SQL injection, cross-site scripting (XSS), and client-side injection attacks.&lt;/p&gt;

&lt;p&gt;To enhance security, it is crucial to specify appropriate character sets, such as UTF-8, and encode data into a standardised character set before further processing. Additionally, employing canonicalisation techniques proves beneficial. For instance, simplifying characters to their basic form helps address issues such as double encoding and obfuscation attacks, thereby bolstering overall security measures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement identity and authentication controls
&lt;/h3&gt;

&lt;p&gt;To further enhance security and minimise the risk of breaches, secure coding practices emphasise the importance of verifying a user's identity at the outset and integrating robust authentication controls into the application's code.&lt;/p&gt;

&lt;p&gt;Here are some recommended measures to achieve this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Employ strong authentication methods, such as multi-factor authentication, to add an additional layer of security.&lt;/li&gt;
&lt;li&gt;Consider incorporating biometric authentication methods, such as fingerprint or facial recognition, especially in mobile applications.&lt;/li&gt;
&lt;li&gt;Ensure secure storage of passwords. Typically, this involves hashing the password using a strong hashing function and securely storing the encrypted hash in a database.&lt;/li&gt;
&lt;li&gt;Implement a secure password recovery mechanism to facilitate password resets while maintaining security.&lt;/li&gt;
&lt;li&gt;Enable session timeouts and inactivity periods to automatically terminate idle sessions.&lt;/li&gt;
&lt;li&gt;For sensitive operations like modifying account information, enforce re-authentication to validate the user's identity.&lt;/li&gt;
&lt;li&gt;Conduct regular audits of authentication transactions to detect any suspicious activities and maintain a vigilant stance against potential threats.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implement access controls
&lt;/h3&gt;

&lt;p&gt;Incorporating a well-thought-out authorisation strategy during the initial stages of application development can greatly enhance the overall security posture. Authorisation entails determining the specific resources that an authenticated user can or cannot access.&lt;/p&gt;

&lt;p&gt;Consider the following guidelines to strengthen the authorisation framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Establish a sequential flow of authentication followed by authorisation. Implement a mechanism where all requests undergo access control checks.&lt;/li&gt;
&lt;li&gt;Adhere to the principle of least privilege, initially denying access to any resource that has not been explicitly configured for access control.&lt;/li&gt;
&lt;li&gt;Enforce time-based limitations on user or system component actions by implementing expiration times, thereby ensuring that actions have defined timeframes for execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By following these practices, developers can create a robust and effective authorisation system that bolsters the overall security of the application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Protect sensitive data
&lt;/h3&gt;

&lt;p&gt;In order to comply with legal and regulatory obligations, it is the responsibility of businesses to safeguard customer data. This sensitive data encompasses various categories, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Personally identifiable information (PII)&lt;/li&gt;
&lt;li&gt;Financial transactions&lt;/li&gt;
&lt;li&gt;Health records&lt;/li&gt;
&lt;li&gt;Web browser data&lt;/li&gt;
&lt;li&gt;Mobile data etc&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To prevent data leakage, it is crucial to employ robust encryption methods for both data at rest and data in transit. Consider the following practices to enhance data protection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Utilise a well-established, peer-reviewed cryptographic library and functions that have been vetted and approved by your security team.&lt;/li&gt;
&lt;li&gt;Avoid storing encryption keys alongside the encrypted data to prevent unauthorised access.&lt;/li&gt;
&lt;li&gt;Refrain from storing confidential or sensitive data in memory, temporary locations, or log files during processing.&lt;/li&gt;
&lt;li&gt;Implement redaction technique in log forwarders to remove sensitive information.&lt;/li&gt;
&lt;li&gt;Implement mandatory re-authentication when accessing sensitive data within the application.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implement logging and intrusion detection
&lt;/h3&gt;

&lt;p&gt;Even the most meticulously designed system can be susceptible to exploitation by attackers. Therefore, it is advisable to incorporate a monitoring system that can detect and identify unusual events. It is crucial to ensure that sufficient information is logged concerning authentication, authorisation, and resource access events. This logging should include details such as timestamps, the origin of access requests, IP addresses, and information pertaining to the requested resource. It is important to store this information in a secure and protected log. Typically, these logs are transmitted in real time to a centralised system where they are analysed for any anomalies. Prior to logging, apply encoding techniques to the untrusted data to safeguard against log injection attacks.&lt;/p&gt;

&lt;p&gt;In the event of a security breach, it is essential to have a well-documented playbook in place to promptly terminate system access, mitigating the risk of further data leakage. By following these practices, organisations can enhance their ability to detect and respond to potential intrusions, minimising the impact of security incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leverage security frameworks and libraries
&lt;/h3&gt;

&lt;p&gt;Avoid unnecessary duplication of effort. Instead, leverage established security frameworks and libraries that have been proven effective. When incorporating such components into your project, ensure they are sourced from reliable and trusted third-party repositories. It is important to regularly assess these libraries for any vulnerabilities or weaknesses and proactively keep them up to date.&lt;/p&gt;

&lt;p&gt;By adopting this approach, you can benefit from the expertise and experience embedded in these established security solutions, saving valuable time and effort while maintaining a strong security posture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor error and exception handling
&lt;/h3&gt;

&lt;p&gt;In line with the best practices of logging, it is advisable to adopt a centralised approach for handling and monitoring errors and exceptions with tools like Sentry. Effective management of errors and exceptions is crucial as mishandling them can inadvertently expose valuable information to potential attackers, enabling them to gain insights into your application and platform design.&lt;/p&gt;

&lt;p&gt;Consider the following measures to strengthen error and exception handling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Avoid logging sensitive information within error messages to prevent inadvertent disclosure.&lt;/li&gt;
&lt;li&gt;Regularly conduct code reviews to identify and address any weaknesses or vulnerabilities in the error handling implementation.&lt;/li&gt;
&lt;li&gt;Utilise negative testing techniques, such as exploratory and penetration testing, fuzzing, and fault injection, to actively identify and rectify potential issues related to error handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By implementing these practices, you can ensure that error and exception handling is performed securely and with minimal risk of exposing sensitive information to potential attackers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of implementing secure coding practices
&lt;/h2&gt;

&lt;p&gt;At this point, the advantages of embracing secure coding practices should be evident:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incorporating automated checks and code analysis during the development process enhances developer productivity by promptly providing feedback to improve code security. This leads to quicker time-to-market and higher-quality code.&lt;/li&gt;
&lt;li&gt;Cost optimisation within the software development lifecycle is achieved by minimising bugs at the early stages.&lt;/li&gt;
&lt;li&gt;Static application security testing (SAST) tools offer developers of all skill levels guardrails, AppSec governance, and valuable insights through IDE plugins. These tools equip developers with the necessary knowledge and resources to bolster application security.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Throughout our examination of coding flaws that can result in vulnerabilities, we have also explored best practices to enhance the security stance of software. However, in the context of large-scale projects, it can be daunting to implement these practices while ensuring proper governance.&lt;/p&gt;

&lt;p&gt;In the realm of extensive projects, the following considerations can help navigate these challenges effectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Establish clear governance frameworks that outline security requirements, procedures, and responsibilities.&lt;/li&gt;
&lt;li&gt;Develop comprehensive guidelines and standards that align with secure coding practices and provide actionable steps for implementation.&lt;/li&gt;
&lt;li&gt;Foster collaboration and communication among development teams, security experts, and stakeholders to ensure a shared understanding of security goals and the necessary measures to achieve them.&lt;/li&gt;
&lt;li&gt;Prioritise the implementation of security measures by identifying high-risk areas and focusing resources accordingly.&lt;/li&gt;
&lt;li&gt;Regularly assess and review the security posture of the software throughout the development lifecycle, enabling continuous improvement and adjustments as needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By adopting these approaches, the process of implementing secure coding practices within large projects becomes more manageable and ensures that proper governance is in place to safeguard against vulnerabilities effectively.&lt;/p&gt;

&lt;p&gt;It is advisable to create and automate workflows using SAST tools and integrate in CI to enforce the best practices. Feel free to schedule a non-obligatory &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;call with us&lt;/a&gt; to discuss DevSecOps strategy and we can help you improve your current practice.&lt;/p&gt;

</description>
      <category>devsecops</category>
      <category>security</category>
      <category>consulting</category>
    </item>
    <item>
      <title>DevOps Roadmap 2022</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Mon, 21 Feb 2022 19:20:28 +0000</pubDate>
      <link>https://forem.com/anjuls/devops-roadmap-2022-38mn</link>
      <guid>https://forem.com/anjuls/devops-roadmap-2022-38mn</guid>
      <description>&lt;p&gt;In the last few weeks, I met some folks in my &lt;a href="https://anjul.dev/mentoring" rel="noopener noreferrer"&gt;mentoring sessions&lt;/a&gt;, who are new to DevOps or in the mid of their career, were interested in knowing what to learn in 2022. DevOps skills are high in demand and there is constant learning required to keep yourself in sync with market demand.&lt;/p&gt;

&lt;p&gt;This post is to share the notes that can help you. Let's see some guidance based on my experience and understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Be fundamentally strong in networking technologies
&lt;/h3&gt;

&lt;p&gt;Understand the concepts such as HTTP/2, QUIC or HTTP3, Layer 4 and Layer 7 protocols, mTLS, Proxies, DNS, BGP, how load balancing works, IPTables, the working of Internet, IP addresses and schemes, and lastly the Network design. I found &lt;a href="https://jvns.ca/" rel="noopener noreferrer"&gt;Julia Evans's&lt;/a&gt; blog very useful and my go to place when I need to understand stuff in a simple way. She has covered a wide variety of topics in her blog posts and zines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Master the operating system fundamentals particularly Linux
&lt;/h3&gt;

&lt;p&gt;As most of the systems (VMs, Containers, etc) run Linux, it is important to know from top to bottom. Learn scheduling, systemd interface, init system, cgroups and namespaces, performance tuning, and mastering the command line utilities - awk, sed, jq, yq, curl, ssh, openssl etc., Learn performance troubleshooting from &lt;a href="https://www.brendangregg.com/" rel="noopener noreferrer"&gt;Brendan's blog&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD
&lt;/h3&gt;

&lt;p&gt;If you are still into Jenkins, it is fine. But, the world has moved to cloud-native pipelines. Conceptually not much has changed in this space, but you can look into Github Actions, Tekton etc. How to do releases better? Understand various deployment strategies such as blue green and canary. &lt;/p&gt;

&lt;h3&gt;
  
  
  Containerization and Virtualization
&lt;/h3&gt;

&lt;p&gt;Apart from the popular Docker runtime, try containerd, podman etc and knowing How to containerise applications, how to implement &lt;a href="https://anjul.dev/blog/top-10-things-for-container-security/" rel="noopener noreferrer"&gt;container security&lt;/a&gt;, how to run and orchestrate VMs in Kubernetes, see KubeVirt project. &lt;/p&gt;

&lt;h3&gt;
  
  
  Container Orchestration
&lt;/h3&gt;

&lt;p&gt;Kubernetes is now a de facto standard for running containers. There is a lot of content on the Internet to learn Kubernetes. Focus on configuration best practices, application design, security and scheduling. Setting up a cluster is getting trivial now but the day 2 operational stuff such as setting up, monitoring, logging, CI/CD, how to scale the cluster, cost optimization and security are some of the challenges that you are expected to solve. &lt;/p&gt;

&lt;h3&gt;
  
  
  Observability at Scale
&lt;/h3&gt;

&lt;p&gt;Most of the engineers are aware of the Prometheus Grafana stack or similar. Trends suggest that many organizations are consolidating their Kubernetes clusters and observability, both from the performance and cost perspective, this helps. Learn about the advanced configurations and architectures of Prometheus, and how to scale them. Look into technologies like Thanos, Cortex, VictoriaMetrics, Datadog, and Loki. Continuous profiling tools such as Parca, periscope, hypertrace and distributed tracing with open telemetry. Service meshes such as Istio are  popular ingredient in cloud-native recipes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Platform team as a Product team
&lt;/h3&gt;

&lt;p&gt;The platform team is becoming more like a centralized product team who are focusing on their internal platform customers such as developers and testers. The goal is to improve the ways of working and bring some order to the teams. Try to improvise on the problems the Developer and QA team faces. You are the enabler for other teams, instead of taking all the work in a central team, coach the dev team to take up typical DevOps responsibilities. That way you can scale and don't burn yourself too much. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2g5yi6g1o4hjqa0i2486.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2g5yi6g1o4hjqa0i2486.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;In many small organisations, security was a second class citizen. Product features were given more priority.  But, due to growing sophisticated attacks and various strict compliances, companies are adapting to a shift-left security strategy. End-to-end encryption, strong RBAC, IAM policies, governance and auditing, implementation of benchmarks such as NIST, CIS, ISO27001 are common. Container security, Policy as code, Cloud Governance and Supply chain security are hot topics. &lt;/p&gt;

&lt;h3&gt;
  
  
  Programming
&lt;/h3&gt;

&lt;p&gt;DevOps or SRE role is now taking the cross-cutting concerns of the Developers and creating tooling that can help in improving their productivity while enforcing the standards. A good quality software engineering practice and skill are required to craft the high quality platform components. &lt;/p&gt;

&lt;p&gt;I can't give enough stress to this. The good organizations are looking for good programming experience in Platform engineers. It is important in site reliability engineering as well, where you need to be fluent in programming, able to read, understand and debug the code written by others and if necessary, fix it. &lt;/p&gt;

&lt;p&gt;Python and Golang are the most popular ones. My suggestion is Golang due to features like strong concurrency, strict type checking, adoption in various orgs, toolchains and as many major projects are built using Golang, it makes sense to learn that over Python. &lt;/p&gt;

&lt;p&gt;A few simple things you can try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write a CLI in your programming language.&lt;/li&gt;
&lt;li&gt;Learn to write a REST API and interact with databases&lt;/li&gt;
&lt;li&gt;Parallelism and Concurrency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Infrastructure as Code
&lt;/h3&gt;

&lt;p&gt;Terraform is a standard in the projects. Once you understand the concept, it is easy to adapt to any other tooling as most of them are based on DSL. &lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud
&lt;/h3&gt;

&lt;p&gt;Most of the cloud works in the same way. So if you know one cloud well, you can easily work with other cloud providers. Focus on how you can design applications using cloud-native components in a highly available, resilient, secured, and cost-effective way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Writing
&lt;/h3&gt;

&lt;p&gt;You might be wondering why I am talking about technical writing when discussing DevOps. A lot of folks don't give enough attention to this, but it is super important how you communicate and work with other teams. The future of work is remote and emails, slack/teams, chats are the primary channels to talk and convey idea to others. &lt;/p&gt;

&lt;p&gt;On a regular basis, you might be creating documents such as runbooks, postmortems, RFCs, architectural decision records and software design docs, to name a few. A clear, easy to understand document does wonders. It can help you save your and the reader's time and improve overall productivity. Suggest you to read &lt;a href="https://blog.pragmaticengineer.com/becoming-a-better-writer-in-tech/" rel="noopener noreferrer"&gt;this article&lt;/a&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Site Reliability Engineering
&lt;/h3&gt;

&lt;p&gt;The boundary between DevOps and SRE is getting thin. In some organisations, the same person might be performing both roles. Understand the concepts behind SLI, SLO, and Error budgets and SRE practices. Each organization does it differently, so I wouldn't suggest copy-paste someone else's culture in to your team. Refer to the &lt;a href="https://sre.google/" rel="noopener noreferrer"&gt;Google SRE culture&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Personally, I am excited about following this year. This is not a definite list as it keeps changing with time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service Mesh - Istio, Cilium Sidecarless mesh, Tetrate and Solo's Gloo mesh offering.&lt;/li&gt;
&lt;li&gt;How to improve Developer Productivity? It is a mix of culture, automation and tools. &lt;/li&gt;
&lt;li&gt;SRE Platforms - &lt;a href="https://Honeycomb.io" rel="noopener noreferrer"&gt;honeycomb&lt;/a&gt;, &lt;a href="https://last9.io/" rel="noopener noreferrer"&gt;Last9&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;DevPortals - again linked with the motive of improving productivity and bridging knowledge gap.&lt;/li&gt;
&lt;li&gt;Observability - technologies such as open telemetry, hypertrace, &lt;a href="https://thanos.io" rel="noopener noreferrer"&gt;Thanos&lt;/a&gt;, VictoriaMetrics, &lt;a href="https://vector.dev" rel="noopener noreferrer"&gt;Vector&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Security - supply chain security, code signing, tightening cloud security.&lt;/li&gt;
&lt;li&gt;Golang - improving the current skills.&lt;/li&gt;
&lt;li&gt;Serverless computing and event-driven architectures&lt;/li&gt;
&lt;li&gt;Web3 - understanding the landscape related to DevOps and Infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Be curious and keep learning. Continuous bite-size learning is easy which you can do along with your full time job. If you still have any questions, feel free to ping me on &lt;a href="https://www.twitter.com/anjuls" rel="noopener noreferrer"&gt;twitter&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I also curate cloud-native articles, tutorials and news in my weekly newsletter. Subscribe to &lt;a href="https://anjulsahu.substack.com" rel="noopener noreferrer"&gt;Cloud Native Weekly&lt;/a&gt; to get the latest updates.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>devops</category>
    </item>
    <item>
      <title>Machine Learning Orchestration on Kubernetes using Kubeflow</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Wed, 24 Mar 2021 05:22:07 +0000</pubDate>
      <link>https://forem.com/infracloud/machine-learning-orchestration-on-kubernetes-using-kubeflow-22nk</link>
      <guid>https://forem.com/infracloud/machine-learning-orchestration-on-kubernetes-using-kubeflow-22nk</guid>
      <description>&lt;h2&gt;
  
  
  MLOps: From Proof Of Concepts to Industrialization
&lt;/h2&gt;

&lt;p&gt;In recent years, AI and Machine Learning have seen tremendous growth across industries in various innovative use cases. It is the most important strategic trend for business leaders. When we dive into technology, the first step is usually experimentation on a small scale and for very basic use cases, then the next step is to scale up operations. Sophisticated ML models help companies efficiently discover patterns, uncover anomalies, make predictions and decisions, and generate insights, and are increasingly becoming a key differentiator in the marketplace. Companies recognise the need to move from proof of concepts to engineered solutions, and to move ML models from development to production. There is a lack of consistency in tools and the development and deployment process is inefficient. As these technologies mature, we need operational discipline and sophisticated workflows to take advantage and operate at scale. This is popularly known as MLOps or ML CI/ CD or ML DevOps. In this article, we explore how this can be achieved with the Kubeflow project, which makes deploying machine learning workflows on Kubernetes simple, portable, and scalable.&lt;/p&gt;

&lt;h3&gt;
  
  
  MLOps in Cloud Native World
&lt;/h3&gt;

&lt;p&gt;There are Enterprise ML platforms like Amazon SageMaker, Azure ML, Google Cloud AI, and IBM Watson Studio in public cloud environments. In case of on-prem and hybrid open source platform, the most notable project is Kubeflow. &lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kubeflow?
&lt;/h2&gt;

&lt;p&gt;Kubeflow is a curated collection of machine learning frameworks and tools. It is a platform for data scientists and ML engineers who want to experiment with their model and design an efficient workflow to develop, test and deploy at scale. It is a portable, scalable, and open-source platform built on top of Kubernetes by abstracting the underlying Kubernetes concepts. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EBkRU8i3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4ne06co7uq6ycoxrdx2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EBkRU8i3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4ne06co7uq6ycoxrdx2m.png" alt="Alt Text" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubeflow Architecture
&lt;/h2&gt;

&lt;p&gt;Kubeflow utilizes various cloud native technologies like Istio, Knative, Argo, Tekton, and leverage Kubernetes primitives such as deployments, services, and custom resources. Istio and Knative help provide capabilities like blue/green deployments, traffic splitting, canary releases, and auto-scaling.  Kubeflow abstracts the Kubernetes components by providing UI, CLI, and easy workflows that non-kubernetes users can use. &lt;/p&gt;

&lt;p&gt;For the ML capabilities, Kubeflow integrates the best framework and tools such as TensorFlow, MXNet, Jupyter Notebooks, PyTorch, and Seldon Core. This integration provides data preparation, training, and serving capabilities. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SNORGdRC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pf9anu6tqv7imcps0vw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SNORGdRC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pf9anu6tqv7imcps0vw2.png" alt="Alt Text" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Let's look at Kubeflow Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Central Dashboard&lt;/strong&gt;: User interface for managing all the Kubeflow pipeline and interacting with various components. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jupyter Notebooks&lt;/strong&gt;: It allows to collaborate with other team members and develop the model. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; - It helps in organizing workflows by tracking and managing the metadata in the artifacts. In this context, metadata means information about executions (runs), models, datasets, and other artifacts. Artifacts are the files and objects that form the inputs and outputs of the components in your ML workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fairing&lt;/strong&gt;: It allows running training job remotely by embedding it in Notebook or local python code and deploy the prediction endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature Store (Feast)&lt;/strong&gt;: It helps in feature sharing and reuse, serving features at scale, providing consistency between training and serving, point-in-time correctness, maintaining data quality and validation. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ML Frameworks&lt;/strong&gt;: This is a collection of frameworks including, Chainer (deprecated), MPI, MXNet, PyTorch, TensorFlow, providing &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Katib&lt;/strong&gt;: It is used to implement Automated machine learning using Hyperparameters (variables to control the model training process), Neural Architecture Search (NAS) to improve predictive accurancy and performance of the model, and a web UI to interact with Katib. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelines&lt;/strong&gt;: Provides end-to-end orchestration and easy to reuse solution to ease the experimentations. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools for Serving&lt;/strong&gt;: There are two model serving systems that allow multi-framework model serving: KFServing, and Seldon Core. You can read more &lt;a href="https://www.kubeflow.org/docs/components/serving/overview/"&gt;about tools for serving here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What are some of the Kubeflow Use Cases?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hybrid multi-cloud ML Platform at scale&lt;/strong&gt;: As Kubeflow is based on Kubernetes, it utilized all the features and power that Kubernetes provides. This allows you to design ML platforms that are portable and utilize the same APIs etc. to run on on-prem and public clouds. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Experimentation&lt;/strong&gt;: Easy UI and abstration helps in rapid experimentation and collaboration. This speeds development by providing guided user journeys. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DevOps for ML platform&lt;/strong&gt;: Kubeflow pipelines can help creating reproducible workflows which delivers consistency, saves iteration time, and helps in debugging, auditability, and compliance requirements.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tuning the model hyperparameters during training&lt;/strong&gt;: During model development, hyperparameters tuning is often hard to tune and time consuming. It is also critical for model performance and accuracy. Katib can reduce the testing time and improve the delivery speed by automating hyperparameters tuning. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Kubeflow Demo
&lt;/h2&gt;

&lt;p&gt;Let's try to learn Kubeflow with an example. In this demo, we will try Kubeflow on a local Kind cluster. You should have at least 16GB of RAM, 8 CPUs modern machine to try it on your local machine, otherwise use a VM in cloud.  We will use Zalando's fashion MNIST dataset and &lt;a href="https://github.com/anjuls/fashion-mnist-kfp-lab/blob/master/KF_Fashion_MNIST.ipynb"&gt;this notebook by &lt;em&gt;manceps&lt;/em&gt;&lt;/a&gt; for demo. &lt;/p&gt;

&lt;p&gt;Due to some &lt;a href="https://github.com/kubeflow/kfctl/issues/406"&gt;issue&lt;/a&gt;, I had to enable few feature gates and extra API server arguments to make it work. &lt;br&gt;
Please use the following Kind configuration to create the cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# kind cluster configuration - kind.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "TokenRequest": true
  "TokenRequestProjection": true
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    metadata:
      name: config
    apiServer:
      extraArgs:
        "service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
        "service-account-issuer": "kubernetes.default.svc"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the Kind cluster and install Kubeflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create Kind cluster
kind create cluster --config kind.yaml


# Deploy Kubeflow on Kind. 

mkdir -p /root/kubeflow/v1.0
cd /root/kubeflow/v1.0
wget https://github.com/kubeflow/kfctl/releases/download/v1.0/kfctl_v1.0-0-g94c35cf_linux.tar.gz

tar -xvf kfctl_v1.0-0-g94c35cf_linux.tar.gz         
export PATH=$PATH:/root/kubeflow/v1.0
export KF_NAME=my-kubeflow
export BASE_DIR=/root/kubeflow/v1.0
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml" 

mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -f ${CONFIG_URI}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It may take 15-20 minutes to bring up all the services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❯ kubectl get pods -n kubeflow
NAME                                                     READY   STATUS    RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0               1/1     Running   0          19m
admission-webhook-deployment-5cd7dc96f5-4hsqr            1/1     Running   0          18m
application-controller-stateful-set-0                    1/1     Running   0          21m
argo-ui-65df8c7c84-dcm6m                                 1/1     Running   0          18m
cache-deployer-deployment-5f4979f45-6fvg2                2/2     Running   1          3m21s
cache-server-7859fd67f5-982mg                            2/2     Running   0          102s
centraldashboard-67767584dc-f5zhh                        1/1     Running   0          18m
jupyter-web-app-deployment-8486d5ffff-4cb8n              1/1     Running   0          18m
katib-controller-7fcc95676b-brk2q                        1/1     Running   0          18m
katib-db-manager-85db457c64-bb7dp                        1/1     Running   3          18m
katib-mysql-6c7f7fb869-c4qqx                             1/1     Running   0          18m
katib-ui-65dc4cf6f5-qrjpm                                1/1     Running   0          18m
kfserving-controller-manager-0                           2/2     Running   0          18m
kubeflow-pipelines-profile-controller-797fb44db9-hdnxc   1/1     Running   0          18m
metacontroller-0                                         1/1     Running   0          19m
metadata-db-6dd978c5b-wtglv                              1/1     Running   0          18m
metadata-envoy-deployment-67bd5954c-z8qrv                1/1     Running   0          18m
metadata-grpc-deployment-577c67c96f-ts9v6                1/1     Running   6          18m
metadata-writer-756dbdd478-7cbgj                         2/2     Running   0          18m
minio-54d995c97b-85xl6                                   1/1     Running   0          18m
ml-pipeline-7c56db5db9-9mswf                             2/2     Running   0          18s
ml-pipeline-persistenceagent-d984c9585-82qvs             2/2     Running   0          18m
ml-pipeline-scheduledworkflow-5ccf4c9fcc-mjrwz           2/2     Running   0          18m
ml-pipeline-ui-7ddcd74489-jw8gj                          2/2     Running   0          18m
ml-pipeline-viewer-crd-56c68f6c85-tszc4                  2/2     Running   1          18m
ml-pipeline-visualizationserver-5b9bd8f6bf-dj2r6         2/2     Running   0          18m
mpi-operator-d5bfb8489-9jzsf                             1/1     Running   0          4m27s
mxnet-operator-7576d697d6-7wj52                          1/1     Running   0          18m
mysql-74f8f99bc8-fddww                                   2/2     Running   0          18m
notebook-controller-deployment-5bb6bdbd6d-vx8tv          1/1     Running   0          18m
profiles-deployment-56bc5d7dcb-8x7vr                     2/2     Running   0          18m
pytorch-operator-847c8d55d8-zgh2x                        1/1     Running   0          18m
seldon-controller-manager-6bf8b45656-6k8r7               1/1     Running   0          18m
spark-operatorsparkoperator-fdfbfd99-5drsc               1/1     Running   0          19m
spartakus-volunteer-558f8bfd47-h2w62                     1/1     Running   0          18m
tf-job-operator-58477797f8-86z42                         1/1     Running   0          18m
workflow-controller-64fd7cffc5-77g6z                     1/1     Running   0          18m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, you can access the Kubeflow dashboard by port-forwarding on http2/$INGRESS_PORT where can be fetched using below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Let's Try an Experiment
&lt;/h3&gt;

&lt;p&gt;We will be using Zalando's Fashion-MNIST dataset to show basic classification using Tensorflow in this experiment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;About the Dataset&lt;/strong&gt;&lt;br&gt;
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the exact image size and structure of training and testing splits.&lt;br&gt;
&lt;em&gt;source&lt;/em&gt;: &lt;a href="https://github.com/zalandoresearch/fashion-mnist"&gt;https://github.com/zalandoresearch/fashion-mnist&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The whole experiment is sourced from &lt;em&gt;manceps&lt;/em&gt; notebook. Create a Jupyter notebook with the name &lt;code&gt;kf-demo&lt;/code&gt; using &lt;a href="https://github.com/anjuls/fashion-mnist-kfp-lab/blob/master/KF_Fashion_MNIST.ipynb"&gt;this notebook&lt;/a&gt;.   &lt;/p&gt;

&lt;p&gt;You can run the notebook from the dashboard and create the pipeline. Please note, in Kubeflow v1.2, there is an issue causing &lt;code&gt;RBAC: permission denied&lt;/code&gt; error while connecting to the pipeline. This will be fixed in v1.3 and you can read more about the issue &lt;a href="https://github.com/kubeflow/pipelines/issues/4440"&gt;here&lt;/a&gt;. As a workaround, you need to create Istio &lt;code&gt;ServiceRoleBinding&lt;/code&gt; and &lt;code&gt;EnvoyFilter&lt;/code&gt; to add an identity in the header. &lt;a href="https://gist.github.com/anjuls/43c7642ddb8be46c9d6095503a296862"&gt;Refer this gist&lt;/a&gt; for the &lt;strong&gt;patch&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;The Kubeflow will orchestrate various components to create the pipeline and run the ML experiment. You can access the results through the dashboard.  Behind the scene, Kubernetes pods, argo workflows, etc., are created which you don't need to worry about. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3FOZy1GM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xcmbgavzi8xps3riq810.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3FOZy1GM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xcmbgavzi8xps3riq810.png" alt="Alt Text" width="800" height="100"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Pods running the kf-demo notebook and pipeline&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I also noticed that when running the Pipeline in Kind, it complained about the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MountVolume.SetUp failed for volume "docker-sock" : hostPath type check
       failed: /var/run/docker.sock is not a socket file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To resolve this, I had to change the Argo Workflow ConfigMap to use &lt;code&gt;pns&lt;/code&gt; instead of &lt;code&gt;docker&lt;/code&gt; container runtime executor.&lt;br&gt;&lt;br&gt;
&lt;a href="/assets/img/Blog/kubeflow/configmap.png" class="article-body-image-wrapper"&gt;&lt;img src="/assets/img/Blog/kubeflow/configmap.png" alt="ConfigMap"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After the change, re-run the experiment from the dashboard, which will then passthrough. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--66Jk4W4a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5htvoacjytv8xe2efts4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--66Jk4W4a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5htvoacjytv8xe2efts4.png" alt="Alt Text" width="800" height="385"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Experiment Flow&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--70bEBvjv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/utphfauefi53y6cz1u8s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--70bEBvjv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/utphfauefi53y6cz1u8s.png" alt="Alt Text" width="800" height="419"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Prediction Result&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you are looking for bringing agility, improved management with enterprise-grade features such as RBAC, multi-tenancy and isolation, security, auditability, collaboration for the machine learning operations in your organization, Kubeflow is an excellent option. It is stable, mature and curated with best-in-class tools and framework which can be deployed in any Kubernetes distribution. See &lt;a href="https://github.com/kubeflow/kubeflow/blob/master/ROADMAP.md"&gt;Kubeflow roadmap here&lt;/a&gt; to look into what's coming in the next version. &lt;/p&gt;

&lt;p&gt;Hope this was helpful to you. Do try Kubeflow and share your experience by connecting with me on &lt;a href="https://twitter.com/anjuls"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://www.gartner.com/smarterwithgartner/gartner-top-strategic-technology-trends-for-2021/"&gt;https://www.gartner.com/smarterwithgartner/gartner-top-strategic-technology-trends-for-2021/&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://www2.deloitte.com/content/dam/insights/articles/6730_TT-Landing-page/DI_2021-Tech-Trends.pdf"&gt;https://www2.deloitte.com/content/dam/insights/articles/6730_TT-Landing-page/DI_2021-Tech-Trends.pdf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.crn.com/news/cloud/5-emerging-ai-and-machine-learning-trends-to-watch-in-2021?itc=refresh"&gt;https://www.crn.com/news/cloud/5-emerging-ai-and-machine-learning-trends-to-watch-in-2021?itc=refresh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cncf.io/blog/2019/07/30/deploy-your-machine-learning-models-with-kubernetes/"&gt;https://www.cncf.io/blog/2019/07/30/deploy-your-machine-learning-models-with-kubernetes/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://events19.linuxfoundation.org/wp-content/uploads/2018/02/OpenFinTech-MLonKube10112018-atin-and-sahdev.pdf"&gt;https://events19.linuxfoundation.org/wp-content/uploads/2018/02/OpenFinTech-MLonKube10112018-atin-and-sahdev.pdf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thenewstack.io/how-kubernetes-could-orchestrate-machine-learning-pipelines/"&gt;https://thenewstack.io/how-kubernetes-could-orchestrate-machine-learning-pipelines/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/community/tutorials/kubernetes-ml-ops"&gt;https://cloud.google.com/community/tutorials/kubernetes-ml-ops&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>kubernetes</category>
      <category>machinelearning</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Autonomous Log Monitoring and Incident Detection with Zebrium</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Sat, 10 Oct 2020 10:32:58 +0000</pubDate>
      <link>https://forem.com/infracloud/autonomous-log-monitoring-and-incident-detection-with-zebrium-2l9o</link>
      <guid>https://forem.com/infracloud/autonomous-log-monitoring-and-incident-detection-with-zebrium-2l9o</guid>
      <description>&lt;h2&gt;
  
  
  &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--b20kkMvW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/10/Zebrium-blog-by-Anjul-header-image--1024x224.png" alt="Zebrium blog by Anjul header image" width="800" height="175"&gt;Why do we need Autonomous Log Monitoring?
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Everything fails, all the time – Werner Vogels, Amazon&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As the Amazon CTO once quoted, systems even if they are thoughtfully designed with the utmost care and skills, may fail. Thus, it is important to detect failures using automation to reduce the burden on DevOps and SREs. Developers use extensive prebuilt libraries and products to go to market as fast as they can due to the high-velocity development lifecycle. It is the onus on the SREs to keep the service alive and keeping the MTTR (Mean time to recover) to a minimum. This leads to a problem when the system becomes a black box for SRE and they have to put observability on top of it. Without knowing the internals and without having complete control over the logging information and metrics, they may run blindfolded sometimes until they learn more about the system, new issues, and until they improvise their playbooks or write a solution to prevent failures from happening in the future. That’s the human way of solving problems — learning by mistakes.&lt;/p&gt;

&lt;p&gt;It is quite a common scenario in a large distributed system when there is an incident, the teams spend a lot of time to capture the right logs, parse it and try to correlate to find the root cause. Some teams are better, they automate log collection, aggregates them to a common platform, and then do all the hard work by searching into the ocean of log data using tools like Elastic or Splunk. It works fine when you understand the log structure and all the components and know what to look for. But as I mentioned above,  it is really hard to keep the data structure consistent for a long time across all components. Most of the current log monitoring and collection tools just provide the capability to collect logs to a central place, parse the unstructured data, allow you to search or filter, and show visualization or trends. What if the system generates a new type of log data or pattern for which you have not automated or planned in advance? It becomes a problem.&lt;/p&gt;

&lt;p&gt;That is the point when you really need autonomous machine learning to scale.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Automation is the key to detect such incidents, anomalies in the system — and proactively try to prevent as much as possible to reduce the chances of failure and improve recovery time&lt;/em&gt;. — Google SRE Handbook&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Typically when an incident occurs, support engineers try to manually peek into the ocean of logs and metrics to find interesting errors and warnings and then start correlating various observations to come up with a root cause.  This is a painfully slow process where a lot of time is wasted. This is where Zebrium machine learning capabilities helps in automatically correlating issues observed in logs and metrics of various components to predict the root cause.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yaZ8uyT2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.zebrium.com/hs-fs/hubfs/Zebrium%2520machine%2520learning.png%3Fwidth%3D843%26name%3DZebrium%2520machine%2520learning.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yaZ8uyT2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.zebrium.com/hs-fs/hubfs/Zebrium%2520machine%2520learning.png%3Fwidth%3D843%26name%3DZebrium%2520machine%2520learning.png" alt="Zebrium machine learning" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing Zebrium
&lt;/h2&gt;

&lt;p&gt;The Zebrium autonomous log and metrics monitoring platform uses machine learning to catch software incidents and show IT and cybersecurity teams the root cause. It is designed to be used with any application, and it is known for its ease of use and quick set-up with customers, who also say the system often delivers initial results within hours of being installed. Unlike traditional monitoring and log management tools that require complex configuration and tuning to detect incidents, Zebrium’s approach to using unsupervised machine learning requires no manual configuration or human training. It is one of the &lt;a href="https://www.forbes.com/sites/louiscolumbus/2020/07/05/gartners-top-25-enterprise-software-startups-to-watch-in-2020/#99d4a9e7822c"&gt;top 25 enterprise software startups to watch for in 2020&lt;/a&gt; in the Gartner report.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1xBGDjbU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://www.zebrium.com/hs-fs/hubfs/Product/Zebrium%2520-%2520how%2520it%2520works-1.gif%3Fwidth%3D1200%26name%3DZebrium%2520-%2520how%2520it%2520works-1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1xBGDjbU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://www.zebrium.com/hs-fs/hubfs/Product/Zebrium%2520-%2520how%2520it%2520works-1.gif%3Fwidth%3D1200%26name%3DZebrium%2520-%2520how%2520it%2520works-1.gif" alt="Zebrium - how it works-1" width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gCPeGCt9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/10/Screenshot-from-2020-10-09-13-15-25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gCPeGCt9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/10/Screenshot-from-2020-10-09-13-15-25.png" alt="Zebrium Dashboard" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Zebrium Dashboard&lt;/p&gt;

&lt;p&gt;Zebrium aggregates logs and metrics and makes them searchable using filters through easy navigation and drill-down. It also allows us to build alert rules — but most of the time you won’t have to! It uses unsupervised machine learning to autonomously learn the implicit structure of the log messages. It then cleanly organizes the content of each event type into tables with typed columns – perfect for fast and rich queries, reliable alerts, and high-quality pattern learning and anomaly detection. But most importantly, it uses machine learning to automatically catch problems and to show you root cause without you having to manually build any rules.&lt;/p&gt;

&lt;p&gt;You can learn more about how it works &lt;a href="https://www.zebrium.com/product/how-zebrium-works"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FFx-xW18--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/10/Screenshot-from-2020-10-09-13-55-50.png" alt="" width="800" height="345"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--s7IC-qQL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/10/Screenshot-from-2020-10-10-15-37-47.png" alt="" width="800" height="318"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QHEAz1tP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/10/Screenshot-from-2020-10-10-15-42-41.png" alt="" width="800" height="301"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Integrations
&lt;/h2&gt;

&lt;p&gt;Zebrium provides various types of log collectors that can pull logs from Kubernetes, Docker, Linux, ECS, Syslog, AWS Cloudwatch, and any type of application.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/zebrium/ze-cloudwatch/raw/master/pkgs/zebrium_cloudwatch-1.0.zip"&gt;&lt;strong&gt;ze-cloudwatch&lt;/strong&gt; Lambda function&lt;/a&gt; – This is the typical pattern to pull logs from AWS Cloudwatch to the Zebrium Platform.&lt;/li&gt;
&lt;li&gt;Container/Docker Log Collector: You can refer to &lt;a href="https://github.com/zebrium/ze-docker-log-collector"&gt;this&lt;/a&gt; for more information.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.zebrium.com/docs/setup/kubernetes/"&gt;&lt;strong&gt;Zebrium Kubernetes log collector&lt;/strong&gt;&lt;/a&gt; – By far the easiest way to stream data from the Kubernetes cluster. It takes less than 2 mins to set up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Metrics Collector&lt;/strong&gt; – Zebrium has &lt;a href="https://github.com/zebrium/ze-stats"&gt;created a metrics collector&lt;/a&gt; to pull Kubernetes metrics and push to the platform. It requires 4Gi memory for every 100 nodes.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/zebrium/ze-fluentd-plugin"&gt;&lt;strong&gt;Zebrium FluentD collector&lt;/strong&gt;&lt;/a&gt; – An easy way to stream logs from a Linux host.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/zebrium/ze-log-forwarder"&gt;Log Forwarder&lt;/a&gt; – to send the Syslog or any raw log to the platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zebrium CLI (Ze)&lt;/strong&gt; – A flexible way to stream log data or upload log files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zebrium + ELK (ZELK) Stack — &lt;a href="https://www.zebrium.com/product/zelkstack"&gt;see here&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hU8wFm_p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.zebrium.com/hs-fs/hubfs/Product/zelkstack.png%3Fwidth%3D438%26name%3Dzelkstack.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hU8wFm_p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.zebrium.com/hs-fs/hubfs/Product/zelkstack.png%3Fwidth%3D438%26name%3Dzelkstack.png" alt="zelkstack" width="673" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Zebrium provides good integration with existing Elastic Stack (ELK Stack) clusters.  You can even view the Zebrium incident dashboard inside Kibana. You can do so by doing the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Configure an additional output plugin in your Logstash instance to send log events and metrics to Zebrium.&lt;/li&gt;
&lt;li&gt;Zebrium’s Autonomous Incident Detection and Root Cause will send incident details back to Logstash via a webhook input plugin.&lt;/li&gt;
&lt;li&gt;Incident summary and drill down into the Incident events in Elasticsearch is available directly from the Zebrium ML-Detected Incidents canvas in Kibana.&lt;/li&gt;
&lt;li&gt;For advanced drill-down and troubleshooting workflows, simply click on the Zebrium link in the Incident canvas.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Third-Party Integrations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Zebrium’s Autonomous Incident &amp;amp; Root Cause Detection&lt;/strong&gt;  works in two modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It can autonomously detect and create incident alerts by applying machine learning to an incoming stream of logs and metrics. The incident alerts can be consumed via custom webhook, Slack, or email.&lt;/li&gt;
&lt;li&gt;Zebrium can also consume an external signal that indicates an incident that HAS occurred, and in response, it will create an incident report consisting of correlated log and metric anomalies, including likely root cause and symptoms surrounding the incident.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A special class of integrations relates to this second mode, including integrations with OpsGenie, PagerDuty, VictorOps, and Slack. Furthermore, Zebrium integration can be extended to any custom application using &lt;a href="https://docs.zebrium.com/docs/webhooks/"&gt;webhooks&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Logical separation and an optional physical separation of data are possible. Each organization’s data is stored in its own schema with proper access control. For those who need further security (physical separation), a dedicated VPC is used.&lt;/li&gt;
&lt;li&gt;Multifactor authentication and encryption. Data at rest is encrypted with AES-256 encryption.&lt;/li&gt;
&lt;li&gt;Handling of sensitive data – Zebrium provides a way to filter out sensitive data/fields. It also provides a way to clinically remove data if uploaded accidentally.&lt;/li&gt;
&lt;li&gt;The system runs in AWS which has PCI DSS, Fedramp, and other leading industry security certifications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What did I like about Zebrium?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Quick and easy onboarding with no manual training or rules setup differentiates this product from others.&lt;/li&gt;
&lt;li&gt;Comes with native collectors to consume logs from Kubernetes clusters, Docker, Linux, and AWS Cloudwatch.&lt;/li&gt;
&lt;li&gt;SaaS-based – provides easy access through Web and Webhooks. This could be a problem for a few who want an on-premise setup.&lt;/li&gt;
&lt;li&gt;Integration with Elastic (ELK) is a plus.&lt;/li&gt;
&lt;li&gt;Unsupervised machine learning doesn’t require any input to train initial data.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zebrium.com/blog/zebrium-grafana-awesome"&gt;Grafana integration&lt;/a&gt; is provided to chart Zebrium collected data.&lt;/li&gt;
&lt;li&gt;Easy to understand pricing structure. A $0 plan to try the core features. &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m-Ffyb4W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://s.w.org/images/core/emoji/12.0.0-1/72x72/1f642.png" alt="🙂" width="72" height="72"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What can be Improved?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Integration with AWS Cloudwatch is provided but not with various other cloud providers like Google, Azure, etc.&lt;/li&gt;
&lt;li&gt;Integration with incident management systems like ServiceNow etc typically deployed in Enterprises is not documented. It may be possible using the webhooks but I haven’t tried.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Machine intelligence is the key to automate and scale in a large enterprise environment which can reduce operational cost by reducing DevOps/SREs and increase MTTR that can radically transform the business.  With the unsupervised learning algorithm used by Zebrium, It becomes easier to find a better correlation between incidents and failures from the log data and metrics without requiring human effort. &lt;strong&gt;Zebrium&lt;/strong&gt; has provided simplified onboarding, that requires no configurational changes in the application or human training, and an easy to navigate UI. It is an appealing next-generation choice in the space of autonomous log and metric management platforms.&lt;/p&gt;

&lt;p&gt;Please try their &lt;a href="https://www.zebrium.com/pricing"&gt;free version&lt;/a&gt; to play around with the autonomous machine learning algorithm on your log data and let us know about your thoughts on autonomous log monitoring.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.infracloud.io/blogs/autonomous-log-monitoring-and-incident-detection-zebrium/"&gt;Autonomous Log Monitoring and Incident Detection with Zebrium&lt;/a&gt; appeared first on &lt;a href="https://www.infracloud.io"&gt;InfraCloud Technologies&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>loggingandmonitoring</category>
      <category>machinelearning</category>
      <category>incident</category>
    </item>
    <item>
      <title>Running Oracle Database on Kubernetes and worried about Backup &amp; Recovery?</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Wed, 09 Sep 2020 12:06:05 +0000</pubDate>
      <link>https://forem.com/anjuls/running-oracle-database-on-kubernetes-and-worried-about-backup-recovery-2jea</link>
      <guid>https://forem.com/anjuls/running-oracle-database-on-kubernetes-and-worried-about-backup-recovery-2jea</guid>
      <description>&lt;p&gt;I have been using Oracle database for more than a decade and one of the challenging tasks as a DBA was always keeping the configurations in the consistent state across environments and I can't forget those nights when I had to recover the database when someone dropped critical data. &lt;/p&gt;

&lt;p&gt;Time has changed. In the last 7 years, since the introduction of docker and Kubernetes, the resiliency and DevOps culture have improved the situation for most of the stateful applications but Oracle has always discouraged running Oracle database as a containerized application. &lt;/p&gt;

&lt;p&gt;If you are interested, read in this &lt;a href="https://www.infracloud.io/blogs/oracle-database-backup-using-kasten-k10/"&gt;post&lt;/a&gt; where we discuss, how to containerize the oracle database and run it on Kubernetes. It answers how not to worry about backup and recovery using Cloud-native solutions like the Kasten K10 platform. It uses a snapshot-based backup of your Kubernetes application and state (data) and provides application-consistent backups. We have tried this on Oracle 12c to 19c and it works without any issues.&lt;/p&gt;

&lt;p&gt;Please &lt;a href="https://www.infracloud.io/blogs/oracle-database-backup-using-kasten-k10/"&gt;try&lt;/a&gt; and let me know your experience. &lt;/p&gt;

&lt;p&gt;Anjul&lt;/p&gt;

</description>
      <category>oracle</category>
      <category>kubernetes</category>
      <category>cloudnative</category>
      <category>database</category>
    </item>
    <item>
      <title>The Ten Commandments of Container Security</title>
      <dc:creator>Anjul Sahu</dc:creator>
      <pubDate>Thu, 30 Jul 2020 17:23:26 +0000</pubDate>
      <link>https://forem.com/infracloud/the-ten-commandments-of-container-security-1nnd</link>
      <guid>https://forem.com/infracloud/the-ten-commandments-of-container-security-1nnd</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lCkRas4s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/07/container-security-header-image-cloud-native-1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lCkRas4s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.infracloud.io/wp-content/uploads/2020/07/container-security-header-image-cloud-native-1.jpg" alt="container-security-cloud-native-header-image-blog" width="800" height="175"&gt;&lt;/a&gt;A cybersecurity incident can cause severe damage to the reputation of the organization and competitive disadvantage in the market, the imposition of penalties, and unwanted legal issues by end-users. On average, the cost of each data breach is USD 3.92 million as per this &lt;a href="https://www.ibm.com/security/data-breach"&gt;IBM report&lt;/a&gt;. The biggest challenges providing security in organizations are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lack of skills and training in security tools and practices&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lack of visibility and vulnerabilities&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Continuous monitoring of the current state of security&lt;/p&gt;

&lt;p&gt;In the recent survey by PaloAlto Networks, &lt;a href="https://www.paloaltonetworks.com/state-of-cloud-native-security"&gt;State of Cloud Security report&lt;/a&gt;, it was discovered that 94% of organizations use one or more cloud platforms and around 45% of their compute is on containers or CaaS. The dominance of containers is increasing and thus the security threats. The top issues identified as being a threat in these reports are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data exposure and malware&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Application vulnerabilities&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weak or broken authentication&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Misconfigurations&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Incorrect or over-permission access&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insider threats&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Credential leakage&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insecure Endpoints&lt;/p&gt;

&lt;p&gt;In this article, we will go through some of the best practices, we can implement to reduce the security risks in the containerized workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top 10 things to do to secure the Application Containers
&lt;/h2&gt;

&lt;h4&gt;
  
  
  1. Source base image from trusted repositories
&lt;/h4&gt;

&lt;p&gt;When we create a container image, we often rely on the seed image sourced from popular private or public registries. Be aware that in the supply chain of the image production, someone can penetrate and drop malicious code which could open the doors to attackers. Just to give an example of this, in 2018, some &lt;a href="https://www.wired.com/story/british-airways-hack-details/"&gt;hacker targetted British Airways web application with malicious javascript code&lt;/a&gt; by attacking their software supply chain. A couple of years back, Docker &lt;a href="https://threatpost.com/malicious-docker-containers-earn-crypto-miners-90000/132816/"&gt;identified few images on Docker Hub&lt;/a&gt; which were having Cryptominers installed in the Image.&lt;/p&gt;

&lt;p&gt;Below are some tips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When creating the container image, please use a hardened base image sources from a well known trusted publisher. &lt;/li&gt;
&lt;li&gt;Pick those images which are published frequently with the latest security fixes and patches. &lt;/li&gt;
&lt;li&gt;Use signed and labeled images (sign with &lt;a href="https://docs.docker.com/notary/getting_started/"&gt;Notary&lt;/a&gt; or similar tools) and verify the authenticity of the image during the pull to stop man-in-the-middle attacks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Install verified packages
&lt;/h4&gt;

&lt;p&gt;As much the sourcing of base image needs to be from trusted sources, the packages installed on top of the base also need to be from verified and trusted sources for the same reason.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Minimize attack surface in the Image
&lt;/h4&gt;

&lt;p&gt;What I mean by surface area is the number of packages and libraries installed in the image. Common sense is if the number of objects is less, the chances of having vulnerability is also reduced. Keep the image size minimal satisfying the application runtime requirements. Preferably, have only a single Application in one application container.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove unnecessary tools and software like package managers (eg. yum, apt), network tools and clients, shells from the image, netcat (can be used to &lt;a href="https://www.hackingtutorials.org/networking/hacking-netcat-part-2-bind-reverse-shells/"&gt;create reverse shell&lt;/a&gt;).
&lt;/li&gt;
&lt;li&gt;Use the multi-stage Dockerfiles to remove software build components out of production images. &lt;/li&gt;
&lt;li&gt;Do not expose unnecessary network ports, sockets or run unwanted services (e.g. SSH daemon) in the container to reduce threats.&lt;/li&gt;
&lt;li&gt;Choose alpine images or scratch images or container optimized OS as compared to full-blown OS images for the base image.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Do not bake secrets in the image
&lt;/h4&gt;

&lt;p&gt;All the secrets should be kept out of the image and Dockerfile. The secrets include SSL certificates, passwords, tokens, API keys, etc should be kept outside and should be securely mounted through the container orchestration engine or external secret manager. Tools like Hashicorp Vault, Cloud provided secret management services like AWS Secrets Manager, Kubernetes secrets, &lt;a href="https://www.docker.com/blog/docker-secrets-management/"&gt;Docker secrets management&lt;/a&gt;, CyberArk, etc. can improve the security posture.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Use of Secure Private or Public Registries
&lt;/h4&gt;

&lt;p&gt;Often the enterprises have their own base images with proprietary software and libraries which they don’t want to distribute in public. Ensure the image is hosted on a secure and trusted registry to prevent unauthorized access. Use a TLS certificate with trusted Root CA, and implement strong authentication to prevent MITM attack.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Do not use privileged or root user to run the application in a container
&lt;/h4&gt;

&lt;p&gt;This is the most common misconfiguration in the containerized workload. With principles of least privileges in mind, create an application user and use it to run the application process inside the container. Why not root? The reason is that a process running in a container is similar to the process running on the host operating system except for the fact that it has additional metadata to identify that it is part of a container. With UID and GID of root user in a container, you can access and modify the files written by root on the host machine.&lt;/p&gt;

&lt;p&gt;Note – If you don’t define any USER in the Dockerfile, it generally means that the Container will be running with root user.&lt;/p&gt;

&lt;h4&gt;
  
  
  7. Implement image vulnerability scanning in CI/CD
&lt;/h4&gt;

&lt;p&gt;When designing CI/CD for the container build and delivery, include image scanning solution to identify vulnerabilities (CVEs) and do not deploy exploitable images without remediation. Tools like Clair, Synk, Anchore, AquaSec, Twistlock can be used.   Some of the container registries like AWS ECR, Quay.io are equipped with scanning solutions – do use them.&lt;/p&gt;

&lt;h4&gt;
  
  
  8. Enable kernel security profiles like AppArmor
&lt;/h4&gt;

&lt;p&gt;AppArmor is a Linux security module to protect OS and its applications from security threats. Docker provides default profile to allow the program to a limited set of resources like network access, kernel capabilities, and file permissions, etc. It reduces the potential attack surface and provides a great in-depth defense.&lt;/p&gt;

&lt;h4&gt;
  
  
  9. Secure centralized and remote logging
&lt;/h4&gt;

&lt;p&gt;Usually, the containers log everything on STDOUT, and these logs are lost once they are terminated, it is important to securely stream the logs to a centralized system for audit and future forensics. We also need to ensure that this logging system is secured and there is no data leakage from the logs. &lt;/p&gt;

&lt;h4&gt;
  
  
  10. Deploy runtime security monitoring
&lt;/h4&gt;

&lt;p&gt;Even if you deploy vulnerability scanning solutions based on repository data and take all necessary precautions, there is still a chance of being victimized.  It is important to continuously monitor and log the application behavior to prevent and detect malicious activities.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;There is no silver bullet solution with Cyber Security, a layered defence is the only viable defence. – ICIT Research&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By implementing the above best practices, you can make it harder for the attacker to find ways to exploit your system. I am pointing out some tools and references that can be used to audit and secure the containers. Security is a vast topic, we haven’t covered Kubernetes specific controls in this article but stay tuned, we can have a follow-up article focussing on the Kubernetes security best practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools
&lt;/h2&gt;

&lt;p&gt;To simplify the adoption of Security controls, I am suggesting few opensource and commercial offerings which can be used to discover the current state, to generate advisories for your workload.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/docker/docker-bench-security"&gt;docker-bench-security&lt;/a&gt; – Official tool by Docker itself to audit the container workload according to the &lt;a href="https://www.cisecurity.org/benchmark/docker/"&gt;CIS Benchmark&lt;/a&gt; for Docker which is an industry-standard benchmark.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/hadolint/hadolint"&gt;Hadolint Linter for Dockerfile&lt;/a&gt; – Use the linter to do static code analysis of the Dockerfile. The linter helps in implementing the best practices. It can be integrated with popular code editors and integration pipelines.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/quay/clair"&gt;Clair&lt;/a&gt; – Clair is a popular static vulnerability scanning tool for the application container. It sources metadata from the various vulnerability databases on a regular basis.  Alternatives are &lt;a href="https://anchore.com/opensource/"&gt;Anchore&lt;/a&gt;, &lt;a href="http://synk.io"&gt;Synk&lt;/a&gt;, &lt;a href="https://github.com/aquasecurity/trivy"&gt;Trivy&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Docker_Security_Cheat_Sheet.html"&gt;OWASP Cheatsheet&lt;/a&gt; – OWASP is an open community which is quite popular among security experts. This cheat sheet is a good starting point to start with.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;ul&gt;
&lt;li&gt;OpenSCAP for Container – Security Content Automation Protocol (SCAP) is a multi-purpose framework of specifications that supports automated configuration, vulnerability and patch checking, technical control compliance activities, and security measurement. It implements NIST standards.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sysdig.com/opensource/falco/"&gt;Sysdig Falco&lt;/a&gt; – Falco can be used to implement runtime security. It uses efficient eBPF to intercept calls and traffic for realtime monitoring and forensics. As hackers continue to evolve, new vulnerabilities are discovered and often not picked up by static scanning tools. A solution with machine learning capability, continuous behavioral monitoring, and advanced AI/ML-based engines can’t be ignored from the list of essentials.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Commercial offerings from AquaSec, Twistlock, Sysdig, Synk, Qualys for Enterprise-grade security tools, and solutions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.stackrox.com/post/2020/04/container-image-security-beyond-vulnerability-scanning/"&gt;https://www.stackrox.com/post/2020/04/container-image-security-beyond-vulnerability-scanning/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.aquasec.com/docker-security-best-practices"&gt;https://blog.aquasec.com/docker-security-best-practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://techbeacon.com/security/10-top-open-source-tools-docker-security"&gt;https://techbeacon.com/security/10-top-open-source-tools-docker-security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-190.pdf"&gt;NIST Application Container Security&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cisecurity.org/benchmark/docker/"&gt;CIS Benchmark for Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/solutions/best-practices-for-building-containers"&gt;https://cloud.google.com/solutions/best-practices-for-building-containers&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do comment if you have any interesting security incident or a preventable hack involving Containers you want to share with the community.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.infracloud.io/blogs/top-10-things-for-container-security/"&gt;The Ten Commandments of Container Security&lt;/a&gt; appeared first on &lt;a href="https://www.infracloud.io"&gt;InfraCloud Technologies&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>container</category>
      <category>security</category>
      <category>bestsecuritypractice</category>
      <category>cloudnativesecurity</category>
    </item>
  </channel>
</rss>
