<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mahra Rahimi</title>
    <description>The latest articles on Forem by Mahra Rahimi (@mahrrah).</description>
    <link>https://forem.com/mahrrah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1006150%2F8676752b-d99f-4978-bad0-1139466f05ef.jpg</url>
      <title>Forem: Mahra Rahimi</title>
      <link>https://forem.com/mahrrah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mahrrah"/>
    <language>en</language>
    <item>
      <title>How to Monitor the Length of Your Individual Azure Storage Queues</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Mon, 27 Jan 2025 13:21:47 +0000</pubDate>
      <link>https://forem.com/mahrrah/how-to-monitor-the-length-of-your-individual-azure-storage-queues-204n</link>
      <guid>https://forem.com/mahrrah/how-to-monitor-the-length-of-your-individual-azure-storage-queues-204n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;  Azure Storage Queues lack built-in metrics for individual queue lengths. However, you can use the Azure SDK to query &lt;code&gt;approximate_message_count&lt;/code&gt; and track each queue's length. Emit this data as custom metrics using OpenTelemetry. A sample project is available to automate this process with Azure Functions for reliable, scalable monitoring.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you're using &lt;a href="https://learn.microsoft.com/en-us/azure/storage/queues/storage-queues-introduction" rel="noopener noreferrer"&gt;Azure Storage Queues&lt;/a&gt; and need (or simply want) to monitor the length of each queue individually, I have some bad news. 😫&lt;/p&gt;

&lt;p&gt;Azure only provides metrics for the total message count across the entire Storage Account via its &lt;a href="https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-storage-storageaccounts-queueservices-metrics" rel="noopener noreferrer"&gt;built-in metrics&lt;/a&gt; feature. Unfortunately, this makes those built-in metrics less useful if you need to track message counts for individual queues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzkievv86cpm31sztg5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzkievv86cpm31sztg5x.png" alt="In-Build Queue Metrics" width="800" height="703"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Example above of the in-built metrics. There are two queues at any given time, but we are unable to identify how many messages are in the individual queues. The filter functionality is disabled, and there is no specific metric for queue message count, as can be seen below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3eik55zu01lhqa8q7rbs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3eik55zu01lhqa8q7rbs.png" alt="In-Build Queue Metrics Types" width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does monitoring individual queue lengths matter?
&lt;/h3&gt;

&lt;p&gt;Monitoring individual queue lengths can be important for several reasons. For instance, if you're managing multiple queues, you may want to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Track a poison message queue&lt;/strong&gt; to avoid disruptions in your system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor the pressure&lt;/strong&gt; on specific queues to ensure they are processing messages efficiently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manage scaling decisions&lt;/strong&gt; by watching how queues grow under different loads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you're debugging or scaling, knowing the message count for each queue helps keep your system healthy.&lt;/p&gt;

&lt;h3&gt;
  
  
  The good news 😊
&lt;/h3&gt;

&lt;p&gt;While Azure doesn’t provide this feature out of the box, there’s an easy workaround, which this blog will walk you through.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Get Your Metrics
&lt;/h2&gt;

&lt;p&gt;As mentioned, Azure does not provide individual Storage Queue lengths as a built-in metric. Given that people have been asking for this feature for the past five years, it's likely not a simple task for Microsoft to implement this as a standard metric. Therefore, finding a workaround might be your best option.&lt;/p&gt;

&lt;p&gt;Naturally, this leads to the question: &lt;em&gt;If standard metrics don’t provide this, is there another way to get it?&lt;/em&gt; 🤔&lt;/p&gt;

&lt;p&gt;A closer look at the &lt;a href="https://learn.microsoft.com/en-us/python/api/overview/azure/storage?view=azure-python" rel="noopener noreferrer"&gt;Azure Storage Account SDK&lt;/a&gt; reveals the &lt;a href="https://learn.microsoft.com/en-us/python/api/azure-storage-queue/azure.storage.queue.queueproperties?view=azure-python" rel="noopener noreferrer"&gt;&lt;code&gt;queue.properties&lt;/code&gt;&lt;/a&gt; attribute &lt;a href="https://learn.microsoft.com/en-us/python/api/azure-storage-queue/azure.storage.queue.queueproperties?view=azure-python#azure-storage-queue-queueproperties-approximate-message-count" rel="noopener noreferrer"&gt;&lt;code&gt;approximate_message_count&lt;/code&gt;&lt;/a&gt;, which gives you access to the information you need—just via a different method.&lt;/p&gt;

&lt;p&gt;Knowing this, wouldn’t it be great if you could use this data to track queue lengths as a metric?&lt;/p&gt;

&lt;h3&gt;
  
  
  Here’s a thought: What if you just do that? 🧠
&lt;/h3&gt;

&lt;p&gt;You can query the length of each queue, create metric gauges  and update the value on a regular basis.&lt;/p&gt;

&lt;p&gt;Let’s break it down step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Get Queue Length
&lt;/h2&gt;

&lt;p&gt;Using the Python SDK, you can easily retrieve the individual length of a queue. See the snippet below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.storage.queue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QueueClient&lt;/span&gt;

&lt;span class="n"&gt;STORAGE_ACCOUNT_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;storage-account-url&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;QUEUE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;queue-name&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;STORAGE_ACCOUNT_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;key&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;STORAGE_ACCOUNT_KEY&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QueueClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;STORAGE_ACCOUNT_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queue_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;QUEUE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_queue_properties&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;message_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;approximate_message_count&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the SDK is built on top of the REST API, similar functionality is available across other SDKs. Here are references for the REST API and SDKs in other languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/rest/api/storageservices/get-queue-metadata#response-headers" rel="noopener noreferrer"&gt;REST API - &lt;code&gt;x-ms-approximate-messages-count: int-value&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/dotnet/api/azure.storage.queues.models.queueproperties.approximatemessagescount?view=azure-dotnet#azure-storage-queues-models-queueproperties-approximatemessagescount" rel="noopener noreferrer"&gt;.NET - &lt;code&gt;ApproximateMessagesCount&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/java/api/com.azure.storage.queue.models.queueproperties?view=azure-java-stable#com-azure-storage-queue-models-queueproperties-getapproximatemessagescount()" rel="noopener noreferrer"&gt;Java - &lt;code&gt;getApproximateMessagesCount()&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Create a Gauge and Emit Metrics
&lt;/h2&gt;

&lt;p&gt;Next, you create a gauge metric to track the the queue length.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A &lt;a href="https://prometheus.io/docs/concepts/metric_types/#gauge" rel="noopener noreferrer"&gt;&lt;strong&gt;gauge&lt;/strong&gt;&lt;/a&gt; is a metric type that measures a value at a particular point in time, making it perfect for tracking queue lengths, which fluctuate constantly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For this, we’ll use &lt;a href="https://opentelemetry.io/docs/what-is-opentelemetry/" rel="noopener noreferrer"&gt;&lt;strong&gt;OpenTelemetry&lt;/strong&gt;&lt;/a&gt;, an open-source observability framework gaining popularity for its versatility in collecting metrics, traces, and logs.&lt;br&gt;
Below is an example of how to emit the queue length as a gauge using OpenTelemetry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Meter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_meter_provider&lt;/span&gt;

&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_meter_provider&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_meter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;METER_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gauge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gauge_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gauge_description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;new_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="err"&gt;⋮&lt;/span&gt; &lt;span class="c1"&gt;# Code to get approximate_message_count and set new_length to it
&lt;/span&gt;
&lt;span class="n"&gt;gauge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another advantage for OpenTelemetry is that it integrates extremly well with various observability tools like Prometheus, Azure Application Insights, Grafana and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Make It Production Ready
&lt;/h2&gt;

&lt;p&gt;While the above approach is great for experimentation, you’ll likely need a more robust solution for a production environment. That’s where resilience and scalability come into play.&lt;/p&gt;

&lt;p&gt;In production, continuously monitoring queues isn’t just about pulling metrics. You need to ensure the system is reliable, scales with demand, and handles potential failures (such as network issues or large volumes of data). For example, you wouldn’t want a failed query to halt your monitoring process.&lt;/p&gt;

&lt;p&gt;If you're interested in seeing how this can be made production-ready, I’ve created a sample project: &lt;a href="https://github.com/MahrRah/azure-storage-queue-monitor" rel="noopener noreferrer"&gt;azure-storage-queue-monitor&lt;/a&gt;. This project wraps everything we’ve discussed into an &lt;a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-overview?pivots=programming-language-csharp" rel="noopener noreferrer"&gt;Azure Function&lt;/a&gt; that runs on a timer trigger. It handles resilience, concurrency, and scales with your queues, ensuring you can monitor them reliably over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now that you have the steps to track individual queue lengths and emit them as custom metrics, you can set this up for your own environment. If you give this a try, feel free to share your experience or improvements—I'd love to hear your thoughts and help if you encounter any issues!&lt;/p&gt;

&lt;p&gt;Happy queue monitoring! 🎉&lt;/p&gt;

</description>
      <category>azurefunctions</category>
      <category>tutorial</category>
      <category>azure</category>
      <category>python</category>
    </item>
    <item>
      <title>How to use Azure VM metadata service to automate post-provisioning metadata configuration in your IaC for VMSS</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Thu, 10 Aug 2023 06:57:50 +0000</pubDate>
      <link>https://forem.com/mahrrah/how-to-use-azure-vm-metadata-service-to-automate-post-provisioning-metadata-configuration-in-your-iac-for-vmss-32g9</link>
      <guid>https://forem.com/mahrrah/how-to-use-azure-vm-metadata-service-to-automate-post-provisioning-metadata-configuration-in-your-iac-for-vmss-32g9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR: How to use &lt;code&gt;cloud-init&lt;/code&gt; for Linux VMs and &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-windows" rel="noopener noreferrer"&gt;Azure Custom Script Extension&lt;/a&gt; for Windows VMs to create a .env file on the VM containing VM metadata from &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=windows" rel="noopener noreferrer"&gt;Azure VM metadata service&lt;/a&gt; when using Azure VM Scale Sets&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When using &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/" rel="noopener noreferrer"&gt;Virtual Machines&lt;/a&gt; or &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/overview" rel="noopener noreferrer"&gt;Virtual Machine Scale Sets&lt;/a&gt; on Azure, it often becomes extremely useful to have certain VM metadata accessible to your applications. This type of metadata (like ID, name, private IP, etc.) gets normaly generated at the provisioning time, and having an automated way for applications to access these will come in handy.&lt;/p&gt;

&lt;p&gt;Azure provides an amazing service called the &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=windows" rel="noopener noreferrer"&gt;Azure VM metadata service&lt;/a&gt;, which can be accessed from within a VM to retrieve a all VM specific information.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; Metadata:true &lt;span class="nt"&gt;--noproxy&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt; &lt;span class="s2"&gt;"http://169.254.169.254/metadata/instance?api-version=2021-02-01"&lt;/span&gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this command is useful, integrating it into your Infrastructure as Code (IaC) can automate the process and ensure scalability.&lt;/p&gt;

&lt;p&gt;In this blog, we'll explore how to package the VM metadata service call into a script, store the metadata in a file, and incorporate this process into both Windows and Linux VMs in a VMSS setup. &lt;/p&gt;

&lt;h2&gt;
  
  
  Creating a Generalized Metadata Retrieval Script
&lt;/h2&gt;

&lt;p&gt;When looking at the VM metadata service endpoint from Azure, everything other than the IP appears to be generic. However, upon closer reading of the Azure documentation, it is mentioned that this "magic" IP is the same for &lt;strong&gt;all&lt;/strong&gt; VMs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Azure's instance metadata service is a RESTful endpoint available to all IaaS VMs created via the new Azure Resource Manager. [..] The [VM metadata service] endpoint is available at a well-known non-routable IP address (169.254.169.254) that can be accessed only from within the VM."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This allows us to easily package the call up in a script and output the metadata in our needed format. For the sake of this blog, we will simply create a file that will contain the information we need.&lt;/p&gt;

&lt;p&gt;Let's proceed with the implementation details for both Windows and Linux VMs. The full code can be found &lt;a href="https://github.com/MahrRah/vmss-vm-metatdata-retrival-sample" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windows VMs: Utilizing Azure Custom Script Extension
&lt;/h3&gt;

&lt;p&gt;For Windows VMs, the &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-windows" rel="noopener noreferrer"&gt;Azure Custom Script Extension&lt;/a&gt; is a powerful tool to execute post-provisioning scripts. Within the script, we can use the VM metadata service to retrieve the VM name and store it in a file under &lt;code&gt;C:\&lt;/code&gt; called &lt;code&gt;vm-metadata.env&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# vm-metadata.ps1vm-metadata.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$vmName&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Invoke-RestMethod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Headers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;@{&lt;/span&gt;&lt;span class="s2"&gt;"Metadata"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Uri&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://169.254.169.254/metadata/instance/compute/name?api-version=2021-02-01&amp;amp;format=text"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="s2"&gt;"VM_NAME=&lt;/span&gt;&lt;span class="nv"&gt;$vmName&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Out-File&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-FilePath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;C:\vm-metadata.env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Append&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the IaC definition, the above script can be passed either via an Azure storage account or from GitHub.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: vmssName
  location: location
  ...
  properties: {
    singlePlacementGroup: null
    platformFaultDomainCount: 1
    virtualMachineProfile: {
      extensionProfile: {
        extensions: [ {
            name: 'CustomScriptExtension'
            properties: {
              publisher: 'Microsoft.Compute'
              type: 'CustomScriptExtension'
              typeHandlerVersion: '1.10'
              settings: {
                commandToExecute: 'powershell -ExecutionPolicy Unrestricted -File vm-metadata.ps1'
                fileUris: [ '&amp;lt;link-to-file&amp;gt;' ]
              }
            }
          } ]
      }
    }
    ...
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linux VMs: Harnessing cloud-init
&lt;/h3&gt;

&lt;p&gt;For Linux VMs, leveraging the native &lt;a href="https://cloudinit.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;code&gt;cloud-init&lt;/code&gt;&lt;/a&gt; tool simplifies the process.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: We could, however, also use the same &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-windows" rel="noopener noreferrer"&gt;Azure Custom Script Extension&lt;/a&gt; as we did for Windows here. Check out the docs for that &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-linux" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Amongst many other things, the &lt;code&gt;cloud-init&lt;/code&gt; definition allows you to specify one or more commands in the &lt;code&gt;runcmd&lt;/code&gt; section, which should run after the initial startup. Just like for the PowerShell script, the VM metadata is called and the extracted VM name is stored in the &lt;code&gt;vm-metadata.env&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#cloud-config&lt;/span&gt;
&lt;span class="na"&gt;runcmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt;  &lt;span class="s"&gt;vmName=$(curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance/compute/name?api-version=2021-02-01&amp;amp;format=text") &amp;amp;&amp;amp; echo "VM_NAME=${vmName}" &amp;gt;&amp;gt; vm-metadata.env&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to regular VMs, the VMSS allows you to set the &lt;code&gt;customData&lt;/code&gt; property when defining your OS profile. It behaves the same way as it does for a VM deployment with &lt;code&gt;cloud-init&lt;/code&gt;, expecting the file to be passed as a base64-encoded string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;param cloudInitScript string = loadFileAsBase64('./cloud-init.yaml')

...

resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: '${prefix}-vmss'
  location: location
  dependsOn: [
    vmssLB
    vmssNSG
  ]
  sku: {
    name: 'Standard_DS1_v2'
    capacity: 1
  }
  properties: {
    singlePlacementGroup: null
    platformFaultDomainCount: 1
    virtualMachineProfile: {
      osProfile: {
        computerNamePrefix: 'vmss'
        adminUsername: 'azureuser'
        adminPassword: adminPassword
        customData: cloudInitScript
      }
      ...

    }
    ...
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And with that, you know how to retrieve VM metadata values for your applications from a VM in your VMSS pool in an automatic fashion :)&lt;/p&gt;

</description>
      <category>azure</category>
      <category>cloudcomputing</category>
      <category>vmss</category>
      <category>azureservices</category>
    </item>
    <item>
      <title>NVIDIA GPU Monitoring on Windows VMs: Tools and Techniques</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Thu, 10 Aug 2023 06:54:39 +0000</pubDate>
      <link>https://forem.com/mahrrah/nvidia-gpu-monitoring-on-windows-vms-tools-and-techniques-3257</link>
      <guid>https://forem.com/mahrrah/nvidia-gpu-monitoring-on-windows-vms-tools-and-techniques-3257</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; How to get NVIDIA GPU utilization on Windows VMs according to GPU mode. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the era of Machine Learning, OpenAI, and ChatGPT, GPUs have gained significant attention. Driven by the rapid growth of machine learning and rendering projects in various industries, GPUs' usage has become increasingly common, even extending beyond the realms of IT to fields like manufacturing and other non-IT sectors.&lt;/p&gt;

&lt;p&gt;However, it's important to note that unlike greenfield projects, most of these companies already possess preexisting IT ecosystems and infrastructures. When building upon such an ecosystem, the likelihood of encountering unconventional technology constellations increases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;One such scenario is NVIDIA GPU metrics retrieval in WDDM mode on Windows machines. While NVIDIA offers tools for Linux-based machines (for instance &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/index.html" rel="noopener noreferrer"&gt;DMGC&lt;/a&gt;), there are fewer comprehensive tools available for Windows-based workloads. Furthermore, these tools might not adequately cover all required use cases simultaneously.&lt;/p&gt;

&lt;p&gt;In this blog, my aim is to guide you through various methods of accessing NVIDIA GPU adapter and process-level utilization on Windows VMs. Hopefully, this can be of assistance to someone out there :)&lt;/p&gt;

&lt;h2&gt;
  
  
  NVIDIA tools for GPU Utilization
&lt;/h2&gt;

&lt;p&gt;There are two main NVIDIA tools that offer access to GPU utilization: NVAPI and NVML.&lt;br&gt;
 It's important to note that these tools differ in terms of the level of granularity they offer for GPU load, and some might be restricted to functioning in only one of the two GPU modes.&lt;/p&gt;

&lt;p&gt;Let's begin by examining the details you can extract from each tool, and in the following section, we will explore the distinctions between the GPU mode approaches.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;NVAPI&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/index.html" rel="noopener noreferrer"&gt;&lt;code&gt;NVAPI&lt;/code&gt; (NVIDIA API)&lt;/a&gt; is the NVIDIA's SDK that gives direct access to the NVIDIA GPU and driver for Windows-based platforms. However, it exclusively provides access to GPU adapter level utilization and does not offer process-level information access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;NVML&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://developer.nvidia.com/nvidia-management-library-nvml" rel="noopener noreferrer"&gt;&lt;code&gt;NVML&lt;/code&gt; (NVIDIA Management Library)&lt;/a&gt;, on the other hand, is a C-based API designed to access various states of the GPU and is the same tool used by &lt;a href="https://developer.nvidia.com/nvidia-system-management-interface" rel="noopener noreferrer"&gt;&lt;code&gt;nvidia-smi&lt;/code&gt;&lt;/a&gt;. Unlike &lt;code&gt;NVAPI&lt;/code&gt;, &lt;code&gt;NVML&lt;/code&gt; allows access to both adapter and process level GPU utilization, making it a more comprehensive tool for monitoring and managing GPU performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  GPU Modes
&lt;/h3&gt;

&lt;p&gt;When dealing with NVIDIA GPUs, it's crucial to be aware of the various modes they can be set to based on your requirements: WDDM and TCC. As mentioned above, not all tools are designed to handle both modes. Therefore, the next section will introduce the different approaches that can be used depending on the GPU mode.&lt;/p&gt;
&lt;h2&gt;
  
  
  TCC Mode Tools
&lt;/h2&gt;

&lt;p&gt;The TCC Mode serves as the computation mode of GPUs, enabled when the CUDA drivers are installed. In this mode, you can easily access adapter and process level GPU utilization using the common &lt;code&gt;nvml.dll&lt;/code&gt; provided by NVIDIA. You can write your own wrapper or leverage existing wrapper libraries and samples available.&lt;br&gt;
Here is a small list for &lt;code&gt;nvml&lt;/code&gt; wrappers in some languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/jcbritobr/nvml-csharp" rel="noopener noreferrer"&gt;C# Wrapper Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/henkelmax/nvmlj" rel="noopener noreferrer"&gt;Java Wrapper Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/pynvml/" rel="noopener noreferrer"&gt;Python Wrapper Library&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  WDDM Mode Tools
&lt;/h2&gt;

&lt;p&gt;On the other hand, the WDDM mode is primarily used for rendering work on GPUs and requires installing the GRID drivers. When operating in WDDM mode, process level metrics can no longer be accessing via the &lt;code&gt;nvml.dll&lt;/code&gt;. Instead, these metrics are now routed through the Windows Performance Counter, requiring a different approach to retrieve them.&lt;/p&gt;

&lt;p&gt;In the next section, we will delve into a small example of how to retrieve GPU load at both the process and overall levels when operating in WDDM mode. This will allow you to access the PerformanceCounter from your code and retrieve GPU memory utilization. We'll focus on the two categories: &lt;code&gt;GPU Process Memory&lt;/code&gt; and &lt;code&gt;GPU Adapter Memory&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: There are, however, many more categories. If you need to access a list of them, the PerformanceCounterCategory provides a static method to retrieve them all: &lt;code&gt;PerformanceCounterCategory.GetCategories()&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Adapter level metrics
&lt;/h4&gt;

&lt;p&gt;As the name &lt;code&gt;GPU Adapter Memory&lt;/code&gt; suggests, this category contains a list of adapters and their load in bytes. The code snippet below demonstrates how to retrieve the load for each adapter and print it in a log line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using System.Diagnostics;

...

var category = new PerformanceCounterCategory("GPU Adapter Memory");
var adapters = category.GetInstanceNames();

foreach ( var adapter in adapters)
{
    var counters = category.GetCounters(adapter);

    foreach (var counter in counters)
    {
        if (counter.CounterName == "Total Committed")
        {
            var value = counter.NextValue();
           Console.WriteLine($"GPU Memory load on adapter {adapter} is {value} bytes.");
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Process level metrics
&lt;/h4&gt;

&lt;p&gt;As before, the category name &lt;code&gt;GPU Process Memory&lt;/code&gt; indicates that it contains a list of processes and their GPU memory load in bytes.&lt;br&gt;
Again, the code snippet will simply print each process and its respective load as a demonstration. This code can be adapted to be used to publish metrics for collection by other tools ( eg. &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, &lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry collector&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using System.Diagnostics;

...

var performanceCounterCategory = new PerformanceCounterCategory("GPU Process Memory");
var processes = performanceCounterCategory.GetInstanceNames();
foreach (var process in processes)
{
    var counters = performanceCounterCategory.GetCounters(process);
    var totalCommittedCounter = counters.FirstOrDefault(counter =&amp;gt; counter.CounterName == "Total Committed");
    var value = totalCommittedCounter.NextValue();
    Console.WriteLine($"GPU Memory load of process {process} is {value} Bytes");
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This category offers a significant advantage over &lt;code&gt;GPU Adapter Memory&lt;/code&gt;, as it provides the ability to filter the 'total load' based on specific processes. This can be particularly helpful when you want to monitor the GPU memory load of specific applications or processes.&lt;/p&gt;

&lt;p&gt;For instance, let's say you have three particular processes of interest, and you want to focus on monitoring only their GPU memory load. In this scenario, utilizing the GPU Process Memory category and applying filters for your targeted processes becomes highly valuable. This enables you to extract precise insights into the GPU memory utilization of these specific applications, allowing for more accurate performance analysis and resource allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In conclusion, as GPUs continue to be a cornerstone of modern computing, understanding the nuances of their management is crucial. While challenges may arise due to different ecosystem, the tools and techniques mentioned above should provide you with a head start in effectively monitoring GPU resources for Windows-based workloads.&lt;/p&gt;

</description>
      <category>nvidia</category>
      <category>gpu</category>
      <category>observability</category>
      <category>window</category>
    </item>
    <item>
      <title>Refactoring GitOps repository to support both real-time and reconciliation window changes</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Fri, 13 Jan 2023 09:36:49 +0000</pubDate>
      <link>https://forem.com/mahrrah/refactoring-gitops-repository-to-support-both-real-time-and-reconciliation-window-changes-2cc</link>
      <guid>https://forem.com/mahrrah/refactoring-gitops-repository-to-support-both-real-time-and-reconciliation-window-changes-2cc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Restructuring GitOps repository to be able to enable multiple reconciliation types. eg real-time and reconciliation window changes with the approach described in the &lt;a href="https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i"&gt;previous part&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For some scenarios allowing only updates to be applied during a reconciliation window is not enough.&lt;br&gt;
There are cases when some application resources should be managed in real time, but others are still only allowed to change during a reconciliation window.&lt;br&gt;
The example we use here is a &lt;code&gt;nginx&lt;/code&gt; deployment to the cluster, which contains a &lt;code&gt;Deployment&lt;/code&gt;, &lt;code&gt;Service&lt;/code&gt;, and a &lt;code&gt;ConfigMap&lt;/code&gt; manifest.&lt;br&gt;
The &lt;code&gt;ConfigMap&lt;/code&gt;, which defines the &lt;code&gt;nginx.conf&lt;/code&gt; should me manageable in real time. However, the &lt;code&gt;Deployment&lt;/code&gt; and the &lt;code&gt;Service&lt;/code&gt; should only be changed with in a reconciliation window.&lt;/p&gt;

&lt;p&gt;Hence, the problem statement changes slightly from the last part:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We want to enable two ways of applying changes to a cluster using Flux:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;Real-time changes:&lt;/strong&gt; Representing the default behavior of Flux when it comes to reconciling changes.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;Reconciliation windows changes:&lt;/strong&gt; Predefined time windows in which a change can be applied to the resource by Flux.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can still use the core approach shown &lt;a href="https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i"&gt;here&lt;/a&gt; to solve our new problem. However, we need to make some adjustments to how we organize our GitOps repository, to enable real-time as well as reconciliation window changes.&lt;/p&gt;

&lt;p&gt;Even though we are only demonstrating the restructuring of this GitOps repository on two reconciliation types. This approach can easily be extended for more types. Just note that, for each new type of reconciliation window, corresponding set of of CronJobs are needed to manage the new windows.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pre-requisits:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IMPORTANT:&lt;/strong&gt; If you haven't already read the &lt;a href="https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i"&gt;first part&lt;/a&gt;, go back and do so, as we will use its approach on how to enable the reconciliation window in this blog.&lt;/li&gt;
&lt;li&gt;Intermediate knowledge of &lt;a href="https://fluxcd.io/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;, &lt;a href="https://kustomize.io/" rel="noopener noreferrer"&gt;Kustomize&lt;/a&gt; and &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;K8s&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Core Principles
&lt;/h2&gt;

&lt;p&gt;Before we start restructuring the repository, it might be useful to understand why we have to do so in the first place.&lt;/p&gt;

&lt;p&gt;As covered in the previous blog, to be able to control the reconciliation cycle differently for a group of resources, these resources need to be managed by an independent &lt;code&gt;Kustomization&lt;/code&gt; resource.&lt;/p&gt;

&lt;p&gt;Because of this the goal of the following sections are:&lt;br&gt;
"Restructure the GitOps repository such that its resources can be managed by one of the N-&lt;code&gt;Kustomization&lt;/code&gt; resources we will create.&lt;br&gt;
Where N defines the number of schedules for applying changes."&lt;/p&gt;

&lt;p&gt;As in this blog we are only interested in real-time and reconciliation window changes, N is equal to 2.&lt;/p&gt;
&lt;h2&gt;
  
  
  Set up
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Set up your applications or components
&lt;/h3&gt;

&lt;p&gt;Let's start with the smallest unit of grouping we have in our GitOps repository: &lt;code&gt;apps&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Looking at the example in &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample1" rel="noopener noreferrer"&gt;this sample&lt;/a&gt;, under &lt;code&gt;apps&lt;/code&gt; we have an &lt;code&gt;nginx&lt;/code&gt; folder, which contains the &lt;code&gt;Deployment&lt;/code&gt;, a &lt;code&gt;Service&lt;/code&gt;, and a &lt;code&gt;ConfigMap&lt;/code&gt; manifest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps
└── nginx
    ├── kustomization.yaml
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As mentioned, we want to now make sure we can change the &lt;code&gt;nginx&lt;/code&gt; server configuration, defined in the &lt;code&gt;configmap.yaml&lt;/code&gt; in real time, but infrastructure changes such as deployment and the service should only change between Monday 8 am to Thursday 5 pm.&lt;/p&gt;

&lt;p&gt;To enable this, the first step is to make sure we can split resources that can be changed real-time from resources that can only change state during a reconciliation window from &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/" rel="noopener noreferrer"&gt;&lt;code&gt;kustomizes&lt;/code&gt;&lt;/a&gt; point of view.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you are not familiar with how &lt;code&gt;kustomize&lt;/code&gt; is used to manage resources check out the official doc from Kubernetes on this at &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/" rel="noopener noreferrer"&gt;Overview of Kustomize&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the ways we can achieve this is by splitting all the resources for each application we have defined under &lt;code&gt;apps/&lt;/code&gt; (see &lt;a href="https://fluxcd.io/flux/guides/repository-structure/#repository-structure" rel="noopener noreferrer"&gt;default GitOps folder structure for mono repos&lt;/a&gt;) into two versions. These versions' sole purpose is to package the resources to be either managed by the real-time or the reconciliation window &lt;code&gt;Kustomization&lt;/code&gt; resource.&lt;/p&gt;

&lt;p&gt;We can then split all manifest files into these two subfolders and add the respective suffixes to the subfolders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time changes: &lt;code&gt;-rt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Reconciliation windows changes: &lt;code&gt;-rw&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Original structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps
└── nginx
    ├── kustomization.yaml
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enabeling real time and reconciliation windows changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps
└── nginx
    ├── nginx-rt
    │   ├── kustomization.yaml
    │   └── configmap.yaml
    └── nginx-rw
        ├── kustomization.yaml
        ├── deployment.yaml
        └── service.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result of this splitting you can see in the sample repository &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample2/apps/nginx" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Set up your clusters
&lt;/h3&gt;

&lt;p&gt;The next step is to restructure the clusters directory. The goal is to make sure we can create two independents &lt;code&gt;Kustomization&lt;/code&gt; resources. This means we need two entry points to point each of the &lt;code&gt;Kustomization&lt;/code&gt; resources to.&lt;br&gt;
For that we split the previous &lt;code&gt;apps&lt;/code&gt; into two subfolders, &lt;code&gt;apps-rt&lt;/code&gt;/&lt;code&gt;apps-rw&lt;/code&gt;.&lt;br&gt;
Where &lt;code&gt;./cluster/&amp;lt;cluster_name&amp;gt;/apps/apps-rt&lt;/code&gt; will be the entry point for the real-time &lt;code&gt;Kustomization&lt;/code&gt; resources and &lt;code&gt;./cluster/&amp;lt;cluster_name&amp;gt;/apps/apps-rw&lt;/code&gt; for the reconciliation window controller.&lt;/p&gt;

&lt;p&gt;Original structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│    └── nginx
└── infra
     └── reconciliation-windows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enabeling real time and reconciliation windows changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│   ├── apps-rw
│   │   └── nginx
│   └── apps-rt
│       └── nginx
└── infra
      └── reconciliation-windows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need to add the &lt;code&gt;kustomization.yaml&lt;/code&gt; and make sure they reference the right resources.&lt;/p&gt;

&lt;p&gt;Let's first have a look at the the &lt;code&gt;kustomization.yaml&lt;/code&gt; in &lt;code&gt;clusters/cluster-1/apps/app-rw&lt;/code&gt; and &lt;code&gt;clusters/cluster-1/apps/app-rt&lt;/code&gt; setup.&lt;br&gt;
Both &lt;code&gt;app-rw&lt;/code&gt; and &lt;code&gt;app-rt&lt;/code&gt; will have a root &lt;code&gt;kustomization.yaml&lt;/code&gt; which will point to all applications deployed onto the cluster. In our example, this is only the &lt;code&gt;nginx&lt;/code&gt; app.&lt;/p&gt;

&lt;p&gt;Folder structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│   ├── apps-rw
│   │   ├── kustomization.yaml
│   │   └── nginx
│   └── apps-rt
│       ├── kustomization.yaml
│       └── nginx
└── infra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kustomization.yaml&lt;/code&gt; files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#clusters/cluster-1/apps/apps-rw/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./nginx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#clusters/cluster-1/apps/apps-rt/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./nginx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Going one level deeper, both the &lt;code&gt;nginx&lt;/code&gt; under &lt;code&gt;clusters/cluster-1/apps/app-rw&lt;/code&gt; and &lt;code&gt;clusters/cluster-1/apps/app-rt&lt;/code&gt; have a similar setup.&lt;br&gt;
To not go over the same thing twice, we are going to only have a look at the &lt;code&gt;clusters/cluster-1/apps/app-rt&lt;/code&gt;. To see the setup of the &lt;code&gt;app-rw&lt;/code&gt; you can check the sample &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample2/clusters/cluster-1/apps/apps-rw" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Folder structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│   ├── apps-rw
│   └── apps-rt
│       ├── kustomization.yaml
│       └── nginx
│           ├── namespace.yaml
│           └── kustomization.yaml
└── infra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kustomization.yaml&lt;/code&gt; files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#clusters/cluster-1/apps/apps-rt/nginx/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./../../../../../apps/nginx/nginx-rt&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./namespace.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As shown above, the application resources referenced under &lt;code&gt;clusters/cluster-1/apps/apps-rt&lt;/code&gt; are the resources we bundled up under &lt;code&gt;apps/nginx/nginx-rt&lt;/code&gt; and should now only contain resources that can be changed in real-time.&lt;/p&gt;

&lt;p&gt;And just like that you have separated all configurations to be managed by different &lt;code&gt;Kustomization&lt;/code&gt; resources!&lt;/p&gt;

&lt;h3&gt;
  
  
  Set up &lt;code&gt;Kustomization&lt;/code&gt; resources.
&lt;/h3&gt;

&lt;p&gt;Our GitOps repository is ready now, but how do we set up the &lt;code&gt;Kustomization&lt;/code&gt; resources?&lt;br&gt;
Let's first create a flux &lt;code&gt;Source&lt;/code&gt; resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create &lt;span class="nb"&gt;source &lt;/span&gt;git &lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/&amp;lt;github-handle&amp;gt;/flux-reconciliation -windows-sample"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;username&amp;gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;PAT&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--git-implementation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;libgit2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--silent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we now need two controllers for apps and one for infra.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create kustomization infra &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./clusters/cluster-1/infra"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create kustomization apps-rt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--depends-on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;infra &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./clusters/cluster-1/apps/apps-rt"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create kustomization apps-rw &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--depends-on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; apps-rt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./clusters/cluster-1/apps/apps-rw"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not this should give you something like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;user@cluster:~&lt;span class="nv"&gt;$ &lt;/span&gt;flux get kustomization
NAME    REVISION        SUSPENDED READY MESSAGE
infra   main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
apps-rt main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
apps-rw main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Now that the cluster is set up, we can upgrade the &lt;code&gt;nginx&lt;/code&gt; version and change the configuration &lt;code&gt;nginx.conf&lt;/code&gt; to include the &lt;code&gt;nginx_status&lt;/code&gt; endpoint and see how one is visible right away, while the other needs a reconciliation window to open.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Initial state
&lt;/h4&gt;

&lt;p&gt;Before we do any changes, we can check out the current state of the nginx deployment.&lt;br&gt;
Get the public &lt;code&gt;ip&lt;/code&gt; address of the machine you are running your cluster on and navigate to the &lt;code&gt;http://&amp;lt;ip&amp;gt;:8080/&lt;/code&gt; we should see somehing like this.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; if you are running it locally you can replace the &lt;code&gt;ip&lt;/code&gt; with &lt;code&gt;localhost&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfxyhpng1unsjfd6u6nn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfxyhpng1unsjfd6u6nn.jpg" alt=" raw `Nginx` endraw  landing page" width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can download the &lt;code&gt;nginx.conf&lt;/code&gt; file by clicking on it and see what configuration is currently mounted into the &lt;code&gt;nginx&lt;/code&gt; pod from the &lt;code&gt;ConfigMap&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  2. Change state
&lt;/h4&gt;

&lt;p&gt;The next step is to change the state of our application.&lt;br&gt;
To change the state of the application we can change the image version number from &lt;code&gt;1.14.2&lt;/code&gt; to the (currently) newest image &lt;code&gt;1.23.3&lt;/code&gt; inside the &lt;code&gt;apps/nginx/nginx-rw/deployment.yaml&lt;/code&gt;. And in the same commit, we can add the configuration shown below to the &lt;code&gt;nginx.conf&lt;/code&gt; section in the &lt;code&gt;apps/nginx/nginx-rt/configmaps.yaml&lt;/code&gt; file to include the new status endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;location&lt;/span&gt; /&lt;span class="n"&gt;nginx_status&lt;/span&gt; {
                &lt;span class="n"&gt;stub_status&lt;/span&gt;;
                &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="n"&gt;all&lt;/span&gt;;
            }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. See real-time changes
&lt;/h4&gt;

&lt;p&gt;Now if we go back to the browser, refresh the page and re-download the file &lt;code&gt;nginx.conf&lt;/code&gt;, we should see the new section we just added.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; It might take up to 2 minutes in the worst case for the &lt;code&gt;Source&lt;/code&gt; and then &lt;code&gt;Kustomization&lt;/code&gt; resource to reconcile&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  4. Wait for reconciliation window to open
&lt;/h4&gt;

&lt;p&gt;If we now wait till the next reconciliation window opens, the pod should be restarted and we should be able to see the version either by checking the resource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod  &amp;lt;nginx-podname&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you don't want to access the machine directly you can go to a non-existing route in the browser eg http://:8080/settings/&lt;code&gt;. There you should see a standard&lt;/code&gt;nginx` 404 page which contains the current deployed version at the bottom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;Let's summarize what we did when it came to restructuring the repository.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;We separated all application resources into two sub-versions. One for resources which can be changed in real-time and one for resources that can only be changed when a reconciliation window is open.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We split the &lt;code&gt;clusters&lt;/code&gt; directory in such a way, so that we can create two independent &lt;code&gt;Kustomization&lt;/code&gt; resources, which reference either one or the other application sub-version.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After this we could create the infra and the two apps &lt;code&gt;Kustomization&lt;/code&gt; resource and start using the solution, as demonstrated.&lt;/p&gt;

&lt;p&gt;So, at its core it boils down to separating the resource definition, in such a way that they are only managed by one of the &lt;code&gt;Kustomization&lt;/code&gt; resources created. This can be done like it's shown above, or slightly differently to fit your needs.&lt;/p&gt;

&lt;p&gt;But hopefully after this second part, you should be good to go on using these reconciliation windows and have the knowledge on how to tweak the setup to fit your use case :)&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to enable reconciliation windows using Flux and K8s native components</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Fri, 13 Jan 2023 09:35:29 +0000</pubDate>
      <link>https://forem.com/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i</link>
      <guid>https://forem.com/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How to enable reconciliation windows for a GitOps Setup using the suspension feature of the flux &lt;code&gt;Kustomize&lt;/code&gt; resource and K8s CronJobs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When using &lt;a href="https://fluxcd.io/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt; to manage a K8s cluster every new change in your repository will be immediately applied to the cluster’s state. In some use cases, the newest changes to a GitOps repository should only apply to the cluster within a designated time window. For example, the cluster should reconcile to the newest changes of the GitOps repository only between Monday 8am to Thursday 5pm. Any change coming in to the GitOps repository on Friday or the weekend will have to wait till Monday 8am to be applied.&lt;/p&gt;

&lt;p&gt;What are the scenarios this could be used for in real life?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sometimes the cluster is connected to external systems, which need to be in maintenance mode before updates can be applied.&lt;/li&gt;
&lt;li&gt;You want to be able to determine a designated time window when the next changes go into production, so that in case of issue you are able to react quickly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So our problem in short:&lt;br&gt;
&lt;em&gt;We want to be able to predefine time windows to deploy all new changes to a cluster that is managed by Flux.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To make things easier, let's call these time windows "reconciliation windows" and dig right into how to solve the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-requisits:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Intermediate knowledge of &lt;a href="https://fluxcd.io/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;, &lt;a href="https://kustomize.io/" rel="noopener noreferrer"&gt;Kustomize&lt;/a&gt; and &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;K8s&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Core principles
&lt;/h2&gt;

&lt;p&gt;Now how do we create such reconciliation windows using Flux and K8s native resources?&lt;br&gt;
To go there we first need to understand how the Flux &lt;a href="https://fluxcd.io/flux/components/kustomize/" rel="noopener noreferrer"&gt;&lt;code&gt;Kustomization&lt;/code&gt;&lt;/a&gt; and Flux &lt;a href="https://fluxcd.io/flux/components/source/" rel="noopener noreferrer"&gt;&lt;code&gt;Source&lt;/code&gt;&lt;/a&gt; resource work, and how we can leverage this to solve our problem.&lt;/p&gt;

&lt;p&gt;When setting up a cluster with Flux there will always be a &lt;code&gt;Source&lt;/code&gt; resource that reconciles the changes from the GitOps repository into the cluster.&lt;br&gt;
After that, the &lt;code&gt;Kustomization&lt;/code&gt; resource will poll the newest changes from the &lt;code&gt;Source&lt;/code&gt; resource and apply them to the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijxhxd2g7br5szq7l8gb.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijxhxd2g7br5szq7l8gb.gif" alt="How Flux controls the cluster using the  raw `Source` endraw  and  raw `Kustomization` endraw  resource"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now interestingly enough both of the reconciliations of these resources can be suspended.&lt;/p&gt;

&lt;p&gt;Suspend &lt;code&gt;Source&lt;/code&gt;/&lt;code&gt;Kustomization&lt;/code&gt; resource from reconciling&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

flux &lt;span class="nb"&gt;suspend source&lt;/span&gt; &amp;lt;name&amp;gt;
flux &lt;span class="nb"&gt;suspend &lt;/span&gt;kustomization &amp;lt;name&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Resume reconciling of &lt;code&gt;Source&lt;/code&gt;/&lt;code&gt;Kustomization&lt;/code&gt; resource&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

flux resume &lt;span class="nb"&gt;source&lt;/span&gt; &amp;lt;name&amp;gt;
flux resume kustomization &amp;lt;name&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Suspending the &lt;code&gt;Kustomization&lt;/code&gt; resource means no changes are applied to the cluster:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8e39738wltv8ph51r1l9.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8e39738wltv8ph51r1l9.gif" alt="Suspending a  raw `Kustomization` endraw  resource"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since our goal is to suspend the reconciliation of the cluster state, just suspending the &lt;code&gt;Kustomization&lt;/code&gt; resource is enough. The &lt;code&gt;Source&lt;/code&gt; resource can continues syncing content in the predefined interval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schedule opening and closing of reconciliation windows
&lt;/h2&gt;

&lt;p&gt;So far so good. But how do we automate this?&lt;br&gt;
Well, K8s has already native ways to support scheduling of jobs, which are &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/" rel="noopener noreferrer"&gt;&lt;code&gt;CronJob&lt;/code&gt; resources&lt;/a&gt;, so why not use them?&lt;/p&gt;

&lt;p&gt;With Cron Jobs we can create an &lt;code&gt;open-reconciliation-window-job&lt;/code&gt; and a &lt;code&gt;close-reconciliation-window-job&lt;/code&gt; which will use the Flux CLI and a &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/" rel="noopener noreferrer"&gt;&lt;code&gt;ServiceAccount&lt;/code&gt;&lt;/a&gt; to resume/suspend the kustomizations.&lt;br&gt;
Let's use the “No-deployment Friday” example. For the reconciliation window from every Monday 8:00 am to Thursday 5:00 pm, this is how the jobs would look.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The &lt;code&gt;ServiceAccount&lt;/code&gt; and the corresponding &lt;code&gt;RoleBinding&lt;/code&gt; and &lt;code&gt;Role&lt;/code&gt; is needed to give the job the right access to perform operations on the cluster resources. For more information on this see the &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/" rel="noopener noreferrer"&gt;K8s docs on configuring service accounts&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# open-reconciliation-window-job.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;open-reconciliation-window&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jobs&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;MON"&lt;/span&gt;
  &lt;span class="na"&gt;suspend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sa-job-runner&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hello&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/fluxcd/flux-cli:v0.36.0&lt;/span&gt;
              &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
              &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/bin/sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flux resume kustomization infra -n flux-system;&lt;/span&gt;
                  &lt;span class="s"&gt;flux resume kustomization apps -n flux-system;&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# close-reconciliation-window-job.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;close-reconciliation-window&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jobs&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;17&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;THU"&lt;/span&gt;
  &lt;span class="na"&gt;suspend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sa-job-runner&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hello&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/fluxcd/flux-cli:v0.36.0&lt;/span&gt;
              &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
              &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/bin/sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flux suspend kustomization infra -n flux-system;&lt;/span&gt;
                  &lt;span class="s"&gt;flux suspend kustomization apps -n flux-system;&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: you can customize the window times as you want by playing with the scheduling string set in &lt;code&gt;specs.schedule&lt;/code&gt;. There are a few online tools to help you understand how these cron-strings work, eg &lt;a href="https://crontab.guru/" rel="noopener noreferrer"&gt;crontab guru&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Scale by using GitOps to manage reconciliation windows in GitOps
&lt;/h2&gt;

&lt;p&gt;At this point, we have the capabilities to resume and suspend, but we still need to create the &lt;code&gt;CronJobs&lt;/code&gt; manually for each cluster.&lt;/p&gt;

&lt;p&gt;Imagine we have a GitOps repository that manages 10+ clusters. Not all of these clusters will probably have their reconciliation window set at the same time. Also, you don't want to manually have to create these jobs, let alone maintain the jobs if for example more &lt;code&gt;Kustomization&lt;/code&gt; resources get added to the cluster.&lt;/p&gt;

&lt;p&gt;Not to worry, there is also a solution for that ;)&lt;/p&gt;

&lt;p&gt;I mean we are already using GitOps? Why not stick the definition of the job into the repository as part of our infrastructure?&lt;br&gt;
And why not use kustomize's &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing" rel="noopener noreferrer"&gt;patch functionality&lt;/a&gt; to overwrite the CronJob's cron string to be able to customize the reconciliation window times for each cluster?&lt;/p&gt;

&lt;p&gt;If that sounds interesting check out the &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample1" rel="noopener noreferrer"&gt;full sample&lt;/a&gt; here.&lt;br&gt;
Now instead of having to manually create the &lt;code&gt;ClusterRole&lt;/code&gt;, &lt;code&gt;RoleBinding&lt;/code&gt;, &lt;code&gt;ServiceAccount&lt;/code&gt;, and &lt;code&gt;CronJobs&lt;/code&gt;, Flux will take care of that for us.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9w776id1qpykc8vpqqk.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9w776id1qpykc8vpqqk.gif" alt="Reconciliation windows"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now this is how we can leverage Flux and K8s native approaches to restrict the application of changes to a cluster to happen only in a reconciliation window.&lt;br&gt;
There are a few advantages to this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For clusters running on the edge, if the connectivity goes down during a reconciliation window, simple changes will still reconcile normally. This is because the &lt;code&gt;Source&lt;/code&gt; resource already pulled the newest changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Careful this only works for image tag changes if there is a local ACR. Else the new images need to be pre-downloaded to the device&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;The GitOps repository reflects the desired state after a reconciliation window of the cluster.&lt;/li&gt;
&lt;li&gt;No need to maintain a custom gateway or such. All the used components are open-source and there is no need for custom logic.&lt;/li&gt;
&lt;li&gt;During the reconciliation windows changes are applied like we used to know from Flux.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What we are however not solving with this, is scheduling fine granular changes. As you might have noticed the granularity end at every resource which is managed by the &lt;code&gt;Kustomization&lt;/code&gt; resource the CronJobs suspend and resume. So individual configuration cannot be managed with this approach.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That did not solve your problem yet and your cluster needs real-time changes, as well as changes within a reconciliation window. Not to worry, got you ;) Check out the &lt;a href="https://dev.to/mahrrah/refactoring-gitops-repository-to-support-both-real-time-and-reconciliation-window-changes-2cc"&gt;next part&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>flux</category>
      <category>gitops</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
