<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tim F</title>
    <description>The latest articles on Forem by Tim F (@tim-aero).</description>
    <link>https://forem.com/tim-aero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1063270%2Fdfac4428-f701-4876-99f9-dfca04be77bc.png</url>
      <title>Forem: Tim F</title>
      <link>https://forem.com/tim-aero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tim-aero"/>
    <language>en</language>
    <item>
      <title>Using Aerospike policies (correctly!)</title>
      <dc:creator>Tim F</dc:creator>
      <pubDate>Thu, 29 Feb 2024 23:28:25 +0000</pubDate>
      <link>https://forem.com/aerospike/using-aerospike-policies-correctly-j06</link>
      <guid>https://forem.com/aerospike/using-aerospike-policies-correctly-j06</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TcIUXDEg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/tim-faulkes/using-aerospike-policies%2520copy_1709244867238.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TcIUXDEg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/tim-faulkes/using-aerospike-policies%2520copy_1709244867238.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Developers familiar with Aerospike will know that the vast majority of the API calls take a &lt;a href="https://aerospike.com/docs/server/guide/policies.html"&gt;“policy”&lt;/a&gt; as a parameter. This policy is optional but can have a significant impact on the application. There are a number of parameters in policies that can be confusing to &lt;a href="https://aerospike.com/developer/"&gt;developers&lt;/a&gt;, and even creating the policies is frequently done incorrectly. This blog will attempt to clear up some of this confusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are policies?
&lt;/h2&gt;

&lt;p&gt;Each API call in Aerospike typically takes a policy as a parameter. For example, consider a simple get in Java:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"test"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"testSet"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; 
           &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Tim"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"age"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;312&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first parameter here is the policy; in this case, it is an instance of the &lt;code&gt;WritePolicy&lt;/code&gt; class. If &lt;code&gt;null&lt;/code&gt; is specified, then the default policy is used. Policies typically contain fields that fall into one of three categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Networking control&lt;/strong&gt;: How should the API call behave if something goes wrong? For example, if the server cannot be reached, should the call be retried on a different server?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Application behavior&lt;/strong&gt;: Parameters that control how the API call should behave in various application-driven scenarios. For example, Aerospike does an “upsert” operation by default on a &lt;code&gt;put&lt;/code&gt;, selecting to either update or insert the record depending on whether it already exists or not. However, the application might require the record to be inserted only, failing if the record already exists in the database.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Filtering&lt;/strong&gt;: Aerospike supports filters on operations, allowing almost all operations to become conditional. For example, an operation can be specified to retrieve a record if the &lt;code&gt;amount&lt;/code&gt; bin in that record is greater than a thousand.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three categories are all in the same policy object, with the networking control defined on the &lt;code&gt;Policy&lt;/code&gt; class, which is a superclass of other policies, such as &lt;code&gt;WritePolicy&lt;/code&gt;. Note that the &lt;code&gt;Policy&lt;/code&gt; class is used for read operations without a subclass; there is no &lt;code&gt;ReadPolicy&lt;/code&gt; class.&lt;/p&gt;

&lt;h2&gt;
  
  
  Networking control policies
&lt;/h2&gt;

&lt;p&gt;The policy superclass defines a number of parameters related to networking. These are critical to optimizing application behavior but are often poorly understood. Let’s take a look at these and the interaction between them.&lt;/p&gt;

&lt;p&gt;These networking fields are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;connectTimeout&lt;/li&gt;
&lt;li&gt;maxRetries&lt;/li&gt;
&lt;li&gt;replica&lt;/li&gt;
&lt;li&gt;sleepBetweenRetries&lt;/li&gt;
&lt;li&gt;socketTimeout&lt;/li&gt;
&lt;li&gt;timeoutDelay&lt;/li&gt;
&lt;li&gt;totalTimeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A diagram is helpful in understanding the interaction between these fields:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kLVBFb02--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/tim-faulkes/using-aerospike-policies-correctly-api-call-sequence_1709244867669.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kLVBFb02--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/tim-faulkes/using-aerospike-policies-correctly-api-call-sequence_1709244867669.png" width="800" height="325"&gt;&lt;/a&gt;&lt;/p&gt;
Figure 1: API call sequence



&lt;p&gt;The diagram shows a timeline of what happens when an API call is submitted to the Aerospike Client Library (ACL) and can be read from left to right. &lt;/p&gt;

&lt;p&gt;The first thing that happens is the ACL needs a connection to the server which holds the data. If this is a simple get or put, this will be a connection to a single server, but if it’s a complex operation such as a batch or query, multiple servers might be involved. The ACL has connection pools to each server node, so acquiring this connection is typically a matter of borrowing one from the pool, which is almost instantaneous. However, if the pool has no available connections, a new connection must be established, which can take some time. This is especially true if the connection is over TLS as the handshake between the client and server on a TLS connection can be slow. &lt;/p&gt;

&lt;p&gt;This connection establishment process has its own timeout field, &lt;code&gt;connectTimeout&lt;/code&gt;. This field is in milliseconds and includes both the connection establishment as well as user authentication (if any). By default, this is set to zero, which is usually a reasonable default – this uses the lesser of the &lt;code&gt;socketTimeout&lt;/code&gt; and &lt;code&gt;totalTimeout&lt;/code&gt;, or 2,000 milliseconds if both are zero. If the application can tolerate extra time when a connection is established, especially when using TLS, then this can be set to a non-zero value. For example, if you want a large &lt;code&gt;connectTimeout&lt;/code&gt; and a small &lt;code&gt;socketTimeout&lt;/code&gt; to allow for protracted SSL handshakes but fast application responses, setting this value to something like 5,000 milliseconds works well.&lt;/p&gt;

&lt;p&gt;Once a connection has been retrieved, the API call is sent to the server. There are two timeouts compared at this point, the &lt;code&gt;socketTimeout&lt;/code&gt; and the &lt;code&gt;totalTimeout&lt;/code&gt;. Both are in milliseconds, and the &lt;code&gt;socketTimeout&lt;/code&gt; reflects how long it takes to wait for data from this API call before retrying or failing. The &lt;code&gt;totalTimeout&lt;/code&gt; specifies the maximum time the call can take, including retries. If the time remaining in the &lt;code&gt;totalTimeout&lt;/code&gt; is less than the &lt;code&gt;socketTimeout&lt;/code&gt;, then the &lt;code&gt;totalTimeout&lt;/code&gt; is used as the &lt;code&gt;socketTimeout&lt;/code&gt; for the call. &lt;/p&gt;

&lt;p&gt;In Figure 1, the &lt;code&gt;socketTimeout&lt;/code&gt; is less than the &lt;code&gt;totalTimeout&lt;/code&gt;, so the ACL waits until one of the following occurs: &lt;code&gt;socketTimeout&lt;/code&gt; milliseconds have elapsed, the connection returns a timeout, or the call succeeds. If the call was unsuccessful in this period, the ACL checks the &lt;code&gt;maxRetries&lt;/code&gt; setting to see if any retries remain. For reads, this value defaults to two, giving it a total of three tries – the initial attempt plus the two retries. For writes, this value defaults to zero, so no retries are attempted.&lt;/p&gt;

&lt;p&gt;Figure 1 depicts a read with the default two retries, so once the original &lt;code&gt;socketTimeout&lt;/code&gt; expires, the API call does not return to the application. Instead, the ACL waits for &lt;code&gt;sleepBetweenRetries&lt;/code&gt; milliseconds and then submits the call again. &lt;/p&gt;

&lt;p&gt;This second call may or may not go to the same server. This is controlled by the &lt;code&gt;replica&lt;/code&gt; policy setting, which defaults to &lt;code&gt;SEQUENCE&lt;/code&gt;. Sequence means that the first attempt is made to the node containing the master record, but on failure, the retry is made to a replica. This can be very useful in a read-sensitive use case, for example. If the &lt;code&gt;socketTimeout&lt;/code&gt; is set to a short interval (say 20 milliseconds) and there are retries with a replica policy of SEQUENCE if the master node is busy and cannot respond in time or has fallen over. Still, if the cluster has not detected this yet, the retry will go to the replica and may return data within the application’s SLA.&lt;/p&gt;

&lt;p&gt;If this second call again times out after &lt;code&gt;socketTimeout&lt;/code&gt; milliseconds, a second delay of &lt;code&gt;sleepBetweenRetry&lt;/code&gt; milliseconds is performed then the second retry (3rd total attempt) is performed.&lt;/p&gt;

&lt;p&gt;Let’s take a look at where this leaves us in Figure 1:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v9THsB90--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/tim-faulkes/using-aerospike-policies-correctly-figure-1_1709244868414.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v9THsB90--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/tim-faulkes/using-aerospike-policies-correctly-figure-1_1709244868414.png" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The ACL is about to start the second retry, but in this case, the remaining time in the &lt;code&gt;totalTimeout&lt;/code&gt; is less than the &lt;code&gt;socketTimeout&lt;/code&gt;. Hence the API call will timeout before the &lt;code&gt;socketTimeout&lt;/code&gt; elapses when the remaining time in the &lt;code&gt;totalTimeout&lt;/code&gt; has elapsed. &lt;/p&gt;

&lt;p&gt;Note that once the &lt;code&gt;totalTimeout&lt;/code&gt; has elapsed, the client API call fails, even if further retries are available. &lt;/p&gt;

&lt;h2&gt;
  
  
  Timeout delay
&lt;/h2&gt;

&lt;p&gt;The only timeout setting not covered in the above is the &lt;code&gt;timeoutDelay&lt;/code&gt; setting. This is not strictly related to the API behavior but rather the system's efficiency. As the above shows, when the ACL submits the call to the server node, and no response is received within &lt;code&gt;socketTimeout&lt;/code&gt; milliseconds, either the call is abandoned, or a retry is performed. But what happens to the connection on which that original request was placed?&lt;/p&gt;

&lt;p&gt;By default, that connection is closed, destroying the connection and not placing it back in the connection pool. This is because it is possible that the server simply did not respond in time, and it might respond some time later. If the connection was reused instead of being closed, this server response might confuse the communication of the reusing call. &lt;/p&gt;

&lt;p&gt;However, as discussed above, connection establishment can be expensive. Closing connections typically means that this connection must be re-established later, which can impact the application. &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;timeoutDelay&lt;/code&gt; parameter specifies a time interval in milliseconds for those connections to be kept alive. If the server responds within that timeout, the socket is drained and then returned to the connection pool. If the timeout is exceeded, then the connection is closed. &lt;/p&gt;

&lt;p&gt;Note that there is a cost of using the &lt;code&gt;timeoutDelay&lt;/code&gt; as the ACL must keep track of which connections have timed out. It is useful in situations where timeouts may occur frequently (for example, aggressive read settings) and connection establishment is expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Default policies
&lt;/h2&gt;

&lt;p&gt;Now that the timeout settings are understood, we can discuss default policies. The ACL defines default values of the various policy attributes, which users can customize to suit their application instead of repeatedly modifying the defaults in the application code. Default policies are set up when the ACL is created and are used when &lt;code&gt;null&lt;/code&gt; is passed to the policy parameter in the API call. Most often, the default policies are customized with the network settings discussed above. &lt;/p&gt;

&lt;p&gt;Here is an example which creates a default &lt;code&gt;writePolicy&lt;/code&gt; with a &lt;code&gt;socketTimeout&lt;/code&gt; of 2,000 milliseconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;WritePolicy&lt;/span&gt; &lt;span class="n"&gt;writePolicyDefault&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WritePolicy&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;writePolicyDefault&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;socketTimeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;ClientPolicy&lt;/span&gt; &lt;span class="n"&gt;clientPolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ClientPolicy&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;clientPolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writePolicyDefault&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writePolicyDefault&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;IAerospikeClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AerospikeClient&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clientPolicy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"127.0.0.1"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the &lt;code&gt;writePolicyDefault&lt;/code&gt; is set on the &lt;code&gt;ClientPolicy&lt;/code&gt; instance before the ACL is created. When passed to the ACL constructor, a copy of the default policies is made so no changes are possible to the default policies.&lt;/p&gt;

&lt;p&gt;After executing this code, passing &lt;code&gt;null&lt;/code&gt; to a write API call will result in the ACL retrieving and using the defaultWritePolicy. Hence, the following two calls are identical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"joe"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWritePolicyDefault&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"joe"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating policies
&lt;/h2&gt;

&lt;p&gt;Timeout settings are typically set on default policies and reused multiple times across the application. However, application-level settings are done as needed by the application on a per-call basis. Consider a check-and-set scenario where a record has been read from the database, changes made, and then those changes pushed back to the database, but only if the record has not been changed since it was read:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Record&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="nc"&gt;WritePolicy&lt;/span&gt; &lt;span class="n"&gt;writePolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWritePolicyDefault&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generation&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generationPolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerationPolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;EXPECT_GEN_EQUAL&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"joe"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code attempts to solve the requirements, getting the &lt;code&gt;defaultWritePolicy&lt;/code&gt; from the ACL and then setting the generation and &lt;code&gt;generationPolicy&lt;/code&gt; correctly. However, this code is very dangerous and will create issues in the application. There is just one &lt;code&gt;defaultWritePolicy&lt;/code&gt; on the ACL, and the &lt;code&gt;getWritePolicyDefault()&lt;/code&gt;call will return a reference to that object, not a copy of that object. Hence, setting the generation values on that policy will affect all calls that use that default policy on all threads, almost certainly causing many calls to fail with an exception.&lt;/p&gt;

&lt;p&gt;Another common pattern to solve this problem is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;WritePolicy&lt;/span&gt; &lt;span class="n"&gt;writePolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WritePolicy&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generation&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generationPolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerationPolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;EXPECT_GEN_EQUAL&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"joe"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this code is an improvement on the previous snippet, it too, has a problem. In this case, a unique &lt;code&gt;WritePolicy&lt;/code&gt; instance has been created, preventing issues with this call affecting other calls. However, using new &lt;code&gt;WritePolicy()&lt;/code&gt; does not pick up the settings of the default write policy, so settings changed on this default policy, such as the &lt;code&gt;socketTimeout&lt;/code&gt; will not apply to this new policy.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;correct&lt;/strong&gt; way to create a new policy to override fields is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;WritePolicy&lt;/span&gt; &lt;span class="n"&gt;writePolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WritePolicy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWritePolicyDefault&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generation&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;generationPolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerationPolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;EXPECT_GEN_EQUAL&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;writePolicy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"joe"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, a new &lt;code&gt;WritePolicy&lt;/code&gt; is created, but it gets a copy of all the settings off the default write policy. This object is local to this API call, so there is no contention with other calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Aerospike’s &lt;code&gt;Policy&lt;/code&gt; classes are both powerful and flexible, allowing fine-grained, per-call control over both network and application-level settings. However, it is important to understand what the settings do to correctly control the API call. Network settings are often overlooked during the application development phase and are frequently poorly understood. A thorough understanding of them is critical to developing applications that perform well not just in the happy-day scenario but also in the face of network issues or server failures.&lt;/p&gt;

&lt;p&gt;Developers who are new to Aerospike should understand this simple pattern to create new policies. It will ensure the correct execution of the API without affecting other concurrent calls and unintended surprises.&lt;/p&gt;

</description>
      <category>aerospike</category>
      <category>policies</category>
    </item>
    <item>
      <title>Comparing Aerospike clusters using "queryPartitions"</title>
      <dc:creator>Tim F</dc:creator>
      <pubDate>Thu, 29 Jun 2023 20:04:09 +0000</pubDate>
      <link>https://forem.com/aerospike/comparing-aerospike-clusters-using-querypartitions-4n3e</link>
      <guid>https://forem.com/aerospike/comparing-aerospike-clusters-using-querypartitions-4n3e</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vn5cUWPl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/jason-dent-JVD3XPqjLaQ-unsplash_1687371689588.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vn5cUWPl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/jason-dent-JVD3XPqjLaQ-unsplash_1687371689588.jpg" width="457" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: Photo by Jason Dent on &lt;a href="https://unsplash.com"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Aerospike is renowned as a very fast, very scalable database capable of storing billions or trillions of records, as well as being able to replicate the data to multipe remote database clusters. Hence, a common question which arises is: "How can I validate that two clusters are in sync?". This used to be a difficult problem, but new API calls in Aerospike v5.6 make this task substantially easier. In this blog we will look at one of these new API calls and use it to develop some code to show how a cluster comparator could be written.&lt;/p&gt;

&lt;h2&gt;
  
  
  Premise
&lt;/h2&gt;

&lt;p&gt;Let's assume we have 2 clusters called "A" and "B" with 10 billion records in them and the records are replicated bi-directionally. We want to efficiently see if the clusters have exactly the same set of records in them. There are a couple of obvious ways to do this, by comparing record counts and by using a primary index query with batch gets. Both of these approaches have flaws however. Let's take a look at these.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparing record counts
&lt;/h3&gt;

&lt;p&gt;The first way we could check if the clusters have exactly the same set of records in them is by comparing record counts. Aerospike provides easy ways to get the total number of records in either a namespace or a set, so we could get a count of all the records in both clusters "A" and "B" per set and compare the 2 values. This relies only on metadata which the cluster keeps updated and so is blazingly fast. However, this can only determine that the clusters are different when the counts don't match. &lt;/p&gt;

&lt;p&gt;What this approach lacks is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The ability to determine &lt;em&gt;which&lt;/em&gt; records are missing. Let's say cluster "A" has 10,000,000,000 records and cluster "B" has 10,000,000,005. Which 5 records are missing from cluster "A"? This cannot be determined by this method, although the set(s) with the missing records might be able to be determined.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Even if the record counts match, there is no guarantee that they both contain the same set of records. Consider a set with records 1,2,3,4,5 with the corresponding set on the other cluster having records 4,5,6,7,8. Both clusters would report 5 records, but the sets clearly do not contain the same records.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Using a Primary Index Query with Batch Gets
&lt;/h3&gt;

&lt;p&gt;A different approach without the above limitations would be to do a primary index (PI) query (previously called a scan) on all the records in cluster "A". Then, for each record returned by the PI query, do a read of the record from cluster "B". If the record is not found, this is a difference between the clusters. We could optimize this by doing a batch read on cluster "B" with a reasonable number of keys (for example 1000) to minimize calls from the client to the server. This would complicate the logic a little but it's still not bad.&lt;/p&gt;

&lt;p&gt;However, this approach suffers from the drawback that it will only detect missing records in cluster "B". If a record exists in cluster "B" and not in cluster "A", this approach will not detect it. The process would need to be run again, with "B" as the source cluster and then batch reading the records from cluster "A".&lt;/p&gt;

&lt;h2&gt;
  
  
  Using QueryParitions
&lt;/h2&gt;

&lt;p&gt;Since Aerospike came out with PI Queries which could guarantee correctness even in the face of node or rack failures in version 5.6, another way has existed, using &lt;code&gt;queryPartitions&lt;/code&gt;. (Note: in versions prior to 6.0 the API call for this is &lt;code&gt;scanPartitions&lt;/code&gt;. This still exists today but is deprecated). Before we can understand what this call does, we need to understand what a partition in Aerospike is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Partitions
&lt;/h3&gt;

&lt;p&gt;Each Aerospike namespace is divided into 4096 logical partitions, which are evenly distributed between the cluster nodes. Each partition contains approximately 1/4096&lt;sup&gt;th&lt;/sup&gt; of the data in the cluster, and an entire partition is always on a single node. When a node is added to, or removed from the cluster, the ownership of some partitions is changed so the cluster will rebalance by moving the data associated with those partition between cluster nodes. During this migration of a partition the node receiving the partition cannot serve that data as it does not have a full copy of the partition.&lt;/p&gt;

&lt;p&gt;Data within a partition is typically stored on Flash (SSD) storage and each partition stores the primary indexes for that partition in a map of balanced red-black trees:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T8e-Al0q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-19at201PM_1687204855119.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T8e-Al0q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-19at201PM_1687204855119.png" width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Primary indexes and storage&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Each tree is referred to as a "sprig" and the map holds pointers to the sprigs for that partition. There are 256 sprigs per partition by default, but this is changeable via the &lt;a href="https://docs.aerospike.com/reference/configuration#partition-tree-sprigs"&gt;partition-tree-sprigs&lt;/a&gt; config parameter.&lt;/p&gt;

&lt;h3&gt;
  
  
  QueryPartitions
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://javadoc.io/static/com.aerospike/aerospike-client/6.1.10/com/aerospike/client/AerospikeClient.html#queryPartitions-com.aerospike.client.policy.QueryPolicy-com.aerospike.client.query.Statement-com.aerospike.client.query.PartitionFilter-"&gt;queryPartitions&lt;/a&gt; method allows one or more of these partitions to be traversed &lt;em&gt;in digest order&lt;/em&gt;. A digest is a unique object identifier created by hashing the record's key and the red-black trees depicted above use the digest to identify the location of a node in the tree. &lt;/p&gt;

&lt;p&gt;When a partition is scanned, a single thread on the server which owns that partition will traverse each sprig in order and within that sprig  will traverse all the nodes in the tree in order. This means that not only will we always get the returned records in the same order, but also that 2 different namespaces - even on different clusters - which contain the records with the same primary key and the same set name will always be traversed in exactly the same order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing a Comparator
&lt;/h2&gt;

&lt;p&gt;Let's look at how we can use this to implement a comparator between 2 clusters. Let's assume there is only one partition for now. We know that the records in this partition will be returned in digest order, so we can effectively treat it as a very large sorted list. (If there are ~10 billion records in the cluster, there are 4096 partitions so each partition should have about 2.4 million records).&lt;/p&gt;

&lt;p&gt;To illustrate how we will do this, let's consider 2 simple clusters each with 12 records in partition 1:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3IexUs1H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1021AM_1687364217559.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3IexUs1H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1021AM_1687364217559.png" width="800" height="83"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;2 simple partitions and their data&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;(Note that the numbers represent the digest of the records).&lt;/p&gt;

&lt;p&gt;If we were to query this partition on cluster 1 and cluster 2 &lt;em&gt;at the same time&lt;/em&gt;, each time we retrieved a record from the queries we would have one of three situations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The record would exist in only cluster 1&lt;/li&gt;
&lt;li&gt;The record would exist in only cluster 2&lt;/li&gt;
&lt;li&gt;The record would exist in both clusters.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first time we retrieve a record from each cluster we will get record 1 on each:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---QyCx1Vs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1040AM_1687365057300.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---QyCx1Vs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1040AM_1687365057300.png" width="800" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;After retrieving the first record on each cluster&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The current record on cluster 1 (Current1) and the current record on cluster 2 (Current2) both have a digest of 1, so there is nothing to do. We read the next record on both clusters to advance to digest 2, which is again the same. We read digest 3 (same on both) so we read the next record on both which gives:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UBJHt49U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1038AM_1687365688577.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UBJHt49U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1038AM_1687365688577.png" width="800" height="142"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Situation after reading next record after digest 3&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When we compare digests here Current1 has digest 5 but Current2 has digest 4. We need to flag the record which Current2 refers to (4) as missing from Cluster 1 and then advance Current2, but &lt;strong&gt;not&lt;/strong&gt; Current1:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QgUiGGW9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1033AM_1687366610437.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QgUiGGW9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1033AM_1687366610437.png" width="800" height="140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;After flagging record 4 as missing from Cluster 1 and advancing Current2&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We're now back in the situation where the 2 values are equal (5) so they progress again as before. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T532fS9q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1104AM_1687369654810.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T532fS9q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-06-21at1104AM_1687369654810.png" width="800" height="142"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Situation after reading next record after digest 5&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here Current1 has 6 but Current2 has 7. We need to flag the record which Current1 refers to (6) as missing from Cluster 2 then advance Current1.&lt;/p&gt;

&lt;p&gt;This process is repeated over and over until the entire partition has been scanned and all the missing records identified.&lt;/p&gt;

&lt;p&gt;The code for this would look something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;RecordSet&lt;/span&gt; &lt;span class="n"&gt;recordSet1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;queryPartitions&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queryPolicy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;statement&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filter1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;RecordSet&lt;/span&gt; &lt;span class="n"&gt;recordSet2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;queryPartitions&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queryPolicy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;statement&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filter2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;side1Valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getNextRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recordSet1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;side2Valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getNextRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recordSet2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;side1Valid&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;side2Valid&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

        &lt;span class="nc"&gt;Key&lt;/span&gt; &lt;span class="n"&gt;key1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;side1Valid&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;recordSet1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getKey&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;Key&lt;/span&gt; &lt;span class="n"&gt;key2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;side2Valid&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;recordSet2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getKey&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
             &lt;span class="c1"&gt;// The digests go down as we go through the partition, so if side 2 is &amp;gt; side 1&lt;/span&gt;
            &lt;span class="c1"&gt;// it means that side 1 has missed this one and we need to advance side2&lt;/span&gt;
             &lt;span class="n"&gt;missingRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partitionId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Side&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SIDE_1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
             &lt;span class="n"&gt;side2Valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getNextRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recordSet2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
             &lt;span class="n"&gt;missingRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partitionId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Side&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SIDE_2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
             &lt;span class="n"&gt;side1Valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getNextRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recordSet1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// The keys are equal, move on.&lt;/span&gt;
            &lt;span class="n"&gt;side1Valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getNextRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recordSet1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;side2Valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getNextRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recordSet2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;recordSet1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;recordSet2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Enhancements
&lt;/h4&gt;

&lt;p&gt;This process will identify missing records between the 2 clusters but not validate the contents of the records are the same between both clusters. If all we need is to identify missing records, the contents of the records are irrelevant - we just need the digest which is contained in the record metadata. Hence, when we set up the scans we can set the query policy to not include the bin data by using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;queryPolicy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;includeBinData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will avoid reading any records off storage and send a much smaller amount of data per record to the client.&lt;/p&gt;

&lt;p&gt;Should we wish to compare record contents, &lt;code&gt;includeBinData&lt;/code&gt; must be set to &lt;code&gt;true&lt;/code&gt; (the default). Then when the digests are the same we can iterate through all the bins and compare contents. This is obviously slower - Aerospike must now read the records off the drive, transmit them to the client, which will then iterate over them to discover differences. However, this is a much more thorough comparison.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling multiple partitions
&lt;/h3&gt;

&lt;p&gt;This process works great for one partition, but how do we cover all partitions in a namespace? We could just iterate through all 4096 partitions applying the same logic. To get concurrency, we could have a pool of threads, each working on one partition at a time and when that partition is complete they move onto the next one which has not yet been processed. This is simple, efficient and allows concurrency to be controlled by the client. Each client-side thread would use just one server-side thread per cluster so the effect on the clusters would be well known.&lt;/p&gt;

&lt;p&gt;It would be possible to have the server scan multiple partitions at once with a single call from the client. However, this is more complex as the results would contain records over multiple partitions, so the client would need to keep state per-partition and identify the partition for each record.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;A full open source implementation of this algorithm can be found &lt;a href="https://github.com/aerospike-examples/cluster-comparator"&gt;here&lt;/a&gt;. This includes many features such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optional comparison of record contents&lt;/li&gt;
&lt;li&gt;Logging the output to a CSV file&lt;/li&gt;
&lt;li&gt;The ability to "touch" records which are missing or different to allow XDR to re-transmit those records&lt;/li&gt;
&lt;li&gt;Selection of start and end partitions&lt;/li&gt;
&lt;li&gt;Selection of start and end last-update-times&lt;/li&gt;
&lt;li&gt;Metadata comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whilst there are a lot more features than are presented in this blog, fundamentally it is based on the simple algorithm presented here, all possible due to the nature of &lt;code&gt;queryPartitions&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>aerospike</category>
      <category>nosql</category>
      <category>database</category>
      <category>developer</category>
    </item>
    <item>
      <title>Optimizing Server Resources using Uniform Balance</title>
      <dc:creator>Tim F</dc:creator>
      <pubDate>Wed, 19 Apr 2023 22:21:01 +0000</pubDate>
      <link>https://forem.com/aerospike/optimizing-server-resources-using-uniform-balance-4d58</link>
      <guid>https://forem.com/aerospike/optimizing-server-resources-using-uniform-balance-4d58</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vzi--0hY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/patrick-fore-JBghIzjbuLs-unsplash_1681239020273.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vzi--0hY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/patrick-fore-JBghIzjbuLs-unsplash_1681239020273.jpg" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: Photo by Patrick Fore on &lt;a href="https://unsplash.com"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Aerospike is known for incredible speed and scalability. As a bonus, people using Aerospike often recognize a far lower Total Cost of Ownership (TCO) compared with other technologies. Optimizing the distribution of data between servers contributes to this low TCO and Aerospike's uniform balance feature allows for almost-perfect even distribution of data across the servers, resulting in better resource utilization and easier capacity planning. This blog post examines how this feature works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Clustering
&lt;/h2&gt;

&lt;p&gt;To fully understand the benefits of how uniform balance works, let's start by reviewing the normal Aerospike partitioning scheme.&lt;/p&gt;

&lt;p&gt;In Aerospike, each node in a cluster must have an ID. This can be allocated through the &lt;code&gt;node-id&lt;/code&gt; configuration parameter, or the system will allocate one, but every node must have a distinct ID. For this explanation, assume there is a 4 node cluster with &lt;code&gt;node-id&lt;/code&gt;s of A, B, C, D. &lt;/p&gt;

&lt;p&gt;When the cluster forms, once all members agree that they can see all other nodes, a deterministic algorithm runs to form a partition map. Aerospike has 4096 partitions, and the partition map dictates how these partitions are assigned to nodes. For example, a normal partition table might look something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5FqMq2-g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1013AM_1681316308681.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5FqMq2-g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1013AM_1681316308681.png" width="683" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1: Partition Table with RF = 2 and 4 nodes&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Forming the partition table
&lt;/h3&gt;

&lt;p&gt;The algorithm used to form this table is fairly simple. The partition ID is appended to the node ID, then hashed and the results sorted. For example, for partition 1 in this table we might have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hash("A:1") =&amp;gt; 12
hash("B:1") =&amp;gt; 96
hash("C:1") =&amp;gt; 43
hash("D:1") =&amp;gt; 120
sort({12, "A:1"}, {96, "B:1"}, {43, "C:1"}, {120, "D:1"}) 
  =&amp;gt; {12, "A:1"}, {43, "C:1"}, {96, "B:1"}, {120, "D:1}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hence node A would be the leader for partition 1, followed by C then B then D. With 2 copies of the data (replication factor = 2), A is the leader and C is the replica.&lt;/p&gt;

&lt;p&gt;This algorithm is applied across all 4096 partitions to form the partition table. This is a deterministic algorithm - the same 4 &lt;code&gt;node-id&lt;/code&gt;s will always form the same partition table.&lt;/p&gt;

&lt;h3&gt;
  
  
  Partitioning design goal
&lt;/h3&gt;

&lt;p&gt;When this partitioning algorithm was created, the goal was to minimize data movement in the event of a node loss or addition. For example, consider what happens if node C fails. All the nodes in the cluster are sending heartbeats to one another and after a number of missed heartbeats (controlled by the &lt;code&gt;timeout&lt;/code&gt; configuration parameter in the &lt;code&gt;heartbeat&lt;/code&gt; section of the configuration file), the node is considered "dead" and ejected from the cluster. &lt;/p&gt;

&lt;p&gt;Immediately after this, the cluster creates a new partition map using the above algorithm, but without C in it. Given the sorting, the effect this has in practice is to "close" the holes left by the removal of C by shifting everything which was to the right of C in the table, one place to the left.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ByqxWjqe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1036AM_1681318332707.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ByqxWjqe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1036AM_1681318332707.png" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2: Partition Table after Node C has been removed&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As Figure 2 shows, the only affected partitions to cause data movement are where C was either the leader or the replica, with these results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On partition 1, C was the replica and B is promoted to replica with A migrating a copy of the data to B. &lt;/li&gt;
&lt;li&gt;On partition 2, C was the leader, so B is promoted to leader (no data movement necessary as it has a full copy of the data) and D is promoted to replica. B migrates a copy of the data onto D. &lt;/li&gt;
&lt;li&gt;On partitions 0 and 4095, no data movement is necessary as C was neither a leader or replica.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Should C be added back into the cluster, it returns to its previous location in the partition map, again minimizing the amount of data required to be migrated. Additionally, if a new node E is introduced to the cluster, only partitions where E is either the leader or the replica will require data migrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue with Design
&lt;/h3&gt;

&lt;p&gt;This algorithm clearly minimizes the amount of data which is migrated between nodes in the case of node removal or addition. However, this is not always optimal from a data volume perspective. Consider a simplified partition map with just 4 partitions in it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--F76UkgyQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1147AM_1681319145857.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F76UkgyQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1147AM_1681319145857.png" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 3: Unbalanced Partition Table&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hashing algorithm described in Figure 3 could easily produce a table like this. This is unbalanced: &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Node&lt;/th&gt;
&lt;th&gt;Leader Partitions&lt;/th&gt;
&lt;th&gt;Follower Partitions&lt;/th&gt;
&lt;th&gt;Total Partitions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This particular distribution highlights 2 problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data volume inconsistency&lt;/strong&gt; : Node A contains 1 partition, node B contains 3 partitions, so node B will have about 3x the data volume of node A. This can make capacity planning very difficult.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Processing load inconsistency&lt;/strong&gt; : Aerospike reads and writes from the leader node by default. Writes are replicated by the leader to the follower, but in most cases all reads are served by the leader. So for writes in this example, node B will be involved as a leader or replica on 3 partitions (0, 2, 3) but node A will only be involved on 1 partition (1). The result is that node B will be processing about 3 times as much write traffic as node A. For read traffic, node B will serve traffic from 2 partitions (0, 3) but node D is the leader of no partitions and hence will serve no read traffic.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a contrived example based on only 4 partitions. In practice, the larger number of partitions Aerospike uses (4096) will smooth this out to some degree, but production customers still saw variances in the nodes in their cluster of up to 50%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Uniform balance
&lt;/h2&gt;

&lt;p&gt;This variance between nodes, both in data volume and processing requests, caused difficulty in capacity planning and node optimization. Aerospike clusters typically have homogeneous nodes with regard to processors, network, memory and storage, so having a heterogeneous distribution of partitions to nodes can result in sub-optimal configurations.&lt;/p&gt;

&lt;p&gt;The solution, introduced in Aerospike Enterprise version &lt;code&gt;4.3.0.2&lt;/code&gt; was to tweak the algorithm. The majority of the partition table still follows the algorithm, but the last 128 partitions are tweaked to even out the number of leader partitions each node has. Consider the contrived table again, with the last row tweaked:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6IiMXkFU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1108AM_1681320861763.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6IiMXkFU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1108AM_1681320861763.png" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 4: Tweaking the Algorithm&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If this table were tweaked in this fashion, every node would end up being the leader of 1 partition and the replica of 1 partition - an optimal distribution. Note that the algorithm doesn't guarantee that the replicas will be perfectly balanced, however they typically are.&lt;/p&gt;

&lt;p&gt;The tweaking of the last 128 rows in real Aerospike clustering is again done in a deterministic manner, meaning that each node can compute the partition table independently. The same invariant is true: when a node leaves and re-enters the cluster, it will appear in the same location in the partition map for each partition. However, due to this tweaking it is possible that more data might have to be migrated than before on these last 128 partitions. In practice, this extra data migration is small and the benefits of even partition distribution typically far outweighs this extra network cost.&lt;/p&gt;

&lt;p&gt;Note that clusters with rack awareness enabled might experience higher migrations under &lt;code&gt;prefer-uniform-balance&lt;/code&gt;. For more information, see this support article about &lt;a href="https://support.aerospike.com/s/article/Is-it-expected-to-have-additional-migrations-after-node-removal-when-prefer-uniform-balance-is-true"&gt;expected migrations after node removal&lt;/a&gt; and &lt;a href="https://support.aerospike.com/s/article/How-does-Rack-Aware-and-Prefer-Uniform-Balance-interact"&gt;How does Rack Aware and Prefer Uniform Balance interact&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;prefer-uniform-balance&lt;/code&gt; configuration flag dictates whether this tweak to the partitioning algorithm is enabled or disabled. This is an Enterprise Edition feature only, the Community Edition behaves as described above without tweaking the last 128 rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actual Results
&lt;/h2&gt;

&lt;p&gt;Testing a complex algorithm like this is good in controlled conditions, but the real proof comes when implemented by customers. One customer with a large, multi-node cluster was kind enough to send us a screenshot of the transactions per second (TPS) per node of a production cluster showing the period when &lt;code&gt;prefer-uniform-balance&lt;/code&gt; was set to &lt;code&gt;true&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OKG652lK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1153AM_1681321257119.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OKG652lK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://developer-hub.s3.us-west-1.amazonaws.com/369902727/Screenshot2023-04-12at1153AM_1681321257119.png" width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Multi-node cluster before and after turning on &lt;code&gt;prefer-uniform-balance&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the initial state had a wide variance of TPS between the nodes, ranging from ~15k TPS to ~30k TPS. Once &lt;code&gt;prefer-uniform-balance&lt;/code&gt; was enabled, this quiesced to a steady state with all nodes around the same level of ~22k TPS, and less than a 5% variance per node. (Note that node maintenance was going on when this graph was captured, resulting in some nodes having no TPS for short periods of time).&lt;/p&gt;

&lt;p&gt;The results from customer deployments of the uniform balance algorithm are so effective, and the down-sides so minimal that &lt;code&gt;prefer-uniform-balance&lt;/code&gt; was defaulted to &lt;code&gt;true&lt;/code&gt; in Aerospike Server version 4.7. &lt;/p&gt;

</description>
      <category>aerospike</category>
      <category>nosql</category>
      <category>database</category>
      <category>scale</category>
    </item>
  </channel>
</rss>
