<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ara</title>
    <description>The latest articles on Forem by Ara (@sadoyan).</description>
    <link>https://forem.com/sadoyan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3477556%2F21a64d2f-2a6d-4a48-b86b-8f43d839658c.png</url>
      <title>Forem: Ara</title>
      <link>https://forem.com/sadoyan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sadoyan"/>
    <language>en</language>
    <item>
      <title>Migrating a ScyllaDB Cluster the “Brain Transplant” Way</title>
      <dc:creator>Ara</dc:creator>
      <pubDate>Sun, 17 May 2026 11:18:57 +0000</pubDate>
      <link>https://forem.com/sadoyan/migrating-a-scylladb-cluster-the-brain-transplant-way-538c</link>
      <guid>https://forem.com/sadoyan/migrating-a-scylladb-cluster-the-brain-transplant-way-538c</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzeunyxt8ym7ev29eoxxi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzeunyxt8ym7ev29eoxxi.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ever tried migrating a ScyllaDB cluster when traditional replication tools are off the table? &lt;/p&gt;

&lt;p&gt;I went a little "mad scientist" and pulled off what I call a cluster brain transplant.&lt;/p&gt;

&lt;p&gt;The idea: copy the raw data files while the source cluster keeps running, then cut over with minimal downtime.&lt;/p&gt;

&lt;p&gt;Risky? Yes. Crazy? Definitely. But it worked — three times in a row. Here's the story of how I did it, why I had no other choice, and what I learned along the way.&lt;/p&gt;

&lt;p&gt;Sometimes, traditional migration methods just don't work. &lt;br&gt;
That was the situation I found myself in when moving a ScyllaDB cluster that was running inside managed Kubernetes. Two-way connectivity between old and new clusters wasn't possible, which meant I had to get something non traditional.&lt;/p&gt;

&lt;p&gt;After a few experiments, I pulled off what I now call the &lt;strong&gt;"brain transplant" migration&lt;/strong&gt;. It's not the official way, but it worked — and worked surprisingly well.&lt;/p&gt;

&lt;p&gt;I have migrated a production database with as little as possible downtime and 100% data consistency. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm51l8mcps5taoytwal17.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm51l8mcps5taoytwal17.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;Normally, ScyllaDB migrations rely on tools like &lt;code&gt;sstableloader&lt;/code&gt; or replication between clusters. But when your cluster runs in managed Kubernetes, networking and connectivity restrictions can get in the way. In my case, it wasn't possible to directly link the old and new clusters in both directions.&lt;/p&gt;

&lt;p&gt;That left me with a crazy idea: what if I just copied the &lt;strong&gt;entire brain&lt;/strong&gt; of the cluster — all the data files, commitlogs, and system state — into a brand-new cluster, and then carefully brought it to life?&lt;/p&gt;


&lt;h2&gt;
  
  
  The Migration Steps
&lt;/h2&gt;

&lt;p&gt;Here's how it went down:&lt;/p&gt;
&lt;h2&gt;
  
  
  Prepare the destination cluster
&lt;/h2&gt;

&lt;p&gt;This is one to one copy solution, so source and destination clusters must have the same amount of nodes and identical configuration . &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install destination cluster.&lt;/strong&gt; Use the Scylla official documentation to install a brand-new cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration.&lt;/strong&gt; The most crucial is the cluster name parameter. Make sure that source and destination clusters have same name in config file. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop The destination.&lt;/strong&gt; Shutdown all nodes in destination cluster and delete &lt;strong&gt;ALL&lt;/strong&gt; data directories. &lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  1. Full rsync copy
&lt;/h3&gt;

&lt;p&gt;I started with a one-to-one &lt;code&gt;rsync&lt;/code&gt; of all Scylla data from the source cluster to the destination cluster.This took a long time (not surprising, given the dataset size), but it was straightforward. Importantly, the source cluster stayed online and continued serving applications during this. Here the exact &lt;code&gt;rsync&lt;/code&gt; command that I have used. Assuming we need to migrate 3 nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;On &lt;span class="nb"&gt;source &lt;/span&gt;node1: rsync &lt;span class="nt"&gt;-val&lt;/span&gt; &lt;span class="nt"&gt;--exclude&lt;/span&gt; &lt;span class="s1"&gt;'commitlog/*'&lt;/span&gt; /var/lib/scylla/&lt;span class="k"&gt;*&lt;/span&gt; destinationsrv1:/var/lib/sylla
On &lt;span class="nb"&gt;source &lt;/span&gt;node2: rsync &lt;span class="nt"&gt;-val&lt;/span&gt; &lt;span class="nt"&gt;--exclude&lt;/span&gt; &lt;span class="s1"&gt;'commitlog/*'&lt;/span&gt; /var/lib/scylla/&lt;span class="k"&gt;*&lt;/span&gt; destinationsrv2:/var/lib/sylla
On &lt;span class="nb"&gt;source &lt;/span&gt;node3: rsync &lt;span class="nt"&gt;-val&lt;/span&gt; &lt;span class="nt"&gt;--exclude&lt;/span&gt; &lt;span class="s1"&gt;'commitlog/*'&lt;/span&gt; /var/lib/scylla/&lt;span class="k"&gt;*&lt;/span&gt; destinationsrv3:/var/lib/sylla
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Incremental rsync runs
&lt;/h3&gt;

&lt;p&gt;After the initial heavy lift, I ran multiple incremental &lt;code&gt;rsync&lt;/code&gt;s. Each one was much faster than the last, because only changed data needed to be copied. Again, the source cluster kept working during this step, so downtime was still zero.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The cutover
&lt;/h3&gt;

&lt;p&gt;When it was time to switch, I stopped the applications pointing to the old cluster. The source cluster was still technically alive, but no longer serving traffic. This was the official "downtime" moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Booting up the new cluster
&lt;/h3&gt;

&lt;p&gt;On the destination side, I started the seed node first, waited for it to come up, then started the remaining nodes one by one. This part took some patience. The logs were noisy with strange-looking system messages, but eventually all nodes settled down and came online.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Validating the cluster
&lt;/h3&gt;

&lt;p&gt;With all nodes running, &lt;code&gt;nodetool status&lt;/code&gt; confirmed that the new cluster was healthy. I could connect with &lt;code&gt;cqlsh&lt;/code&gt;, query some tables, and see real data.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Repairing
&lt;/h3&gt;

&lt;p&gt;To make sure everything was consistent, I ran &lt;code&gt;nodetool repair&lt;/code&gt; on each destination node, one by one. This is a normal part of cluster maintenance, and it completed without errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Final shutdown of the source
&lt;/h3&gt;

&lt;p&gt;Once I was confident the destination cluster was working correctly, I shut down the old Kubernetes-based cluster for good.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Phantom memories (The final step)
&lt;/h3&gt;

&lt;p&gt;As we did a brain transplant, the new system will have "Phantom memories" about previous nodes in a cluster. &lt;code&gt;nodetool status&lt;/code&gt; will show clean cluster with new nodes only, you will be able to get and set data, but not able to do metadata changes, add or remove nodes.&lt;/p&gt;

&lt;p&gt;The reason is that there is still information about nodes from source cluster, which does not exist anymore, but Scylla tries to connect and get some information. &lt;/p&gt;

&lt;p&gt;The symptom messages in log like these :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scylla:  [shard  0:main] raft_group_registry - (rate limiting dropped 2999 similar messages) Raft server id d9756728-be49-4cbf-8e2c-417aa8b917c1 cannot be translated to an IP address.
scylla:  [shard  0:main] raft_group_registry - (rate limiting dropped 2999 similar messages) Raft server id e671084b-c41f-4eec-a73c-4c2eaf48ac38 cannot be translated to an IP address.
scylla:  [shard  0:main] raft_group_registry - (rate limiting dropped 2999 similar messages) Raft server id 96edc6f8-4e36-4044-ab53-c0a95a3873f7 cannot be translated to an IP address.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitor logs for messages like above and run &lt;code&gt;nodetool removenode ID&lt;/code&gt; on any of cluster member nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nodetool removenode d9756728-be49-4cbf-8e2c-417aa8b917c1
nodetool removenode e671084b-c41f-4eec-a73c-4c2eaf48ac38
nodetool removenode 96edc6f8-4e36-4044-ab53-c0a95a3873f7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On success, nothing should be printed on stdout.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continue monitoring logs and remove all phantom nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Manual recovery and Raft reset.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This procedure is needed as the nodes have changed their IDs, but the Raft database is not  cleared. Even after performing Phantom nodes removal, described above, yor cluster may, most probably will, keep old nodes in Raft. So manual removal is required.&lt;/p&gt;

&lt;p&gt;During this period tour cluster will be rolling restarted and put in a &lt;strong&gt;RECOVERY&lt;/strong&gt; mode.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Procedure
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Perform the following query on every alive node in the cluster, using e.g. cqlsh:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;cqlsh&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scylla_local&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'recovery'&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'group0_upgrade_state'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Perform a rolling restart of your alive nodes.&lt;/li&gt;
&lt;li&gt;Verify that all the nodes have entered RECOVERY mode when restarting; look for one of the following messages in their logs:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;group0_client - RECOVERY mode.
raft_group0 - setup_group0: Raft RECOVERY mode, skipping group 0 setup.
raft_group0_upgrade - RECOVERY mode. Not attempting upgrade.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Remove all your dead nodes using the node removal procedure.&lt;/li&gt;
&lt;li&gt;Remove existing Raft cluster data by performing the following queries on every alive node in the cluster, using e.g. cqlsh:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;cqlsh&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;TRUNCATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topology&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;cqlsh&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;TRUNCATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discovery&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;cqlsh&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;TRUNCATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group0_history&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;cqlsh&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scylla_local&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'raft_group0_id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Make sure that schema is synchronized in the cluster by executing nodetool describecluster on each node and verifying that the schema version is the same on all nodes.&lt;/li&gt;
&lt;li&gt;We can now leave RECOVERY mode. On every alive node, perform the following query:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;cqlsh&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scylla_local&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'group0_upgrade_state'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Perform a rolling restart of your alive nodes.&lt;/li&gt;
&lt;li&gt;The Raft upgrade procedure will start anew. Verify that it finishes successfully.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why This Worked
&lt;/h2&gt;

&lt;p&gt;At first glance, this approach sounds risky — copying live data files and commitlogs while the source cluster is still running. And yet, Scylla's design and eventual consistency model made it surprisingly resilient.&lt;/p&gt;

&lt;p&gt;By repeatedly syncing the data and commitlogs, then repairing the new cluster after startup, I ended up with a clean and working copy. It's a bit like pausing a brain, moving it into a new body, and jump-starting it again.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It's not the official method.&lt;/strong&gt; This was a pragmatic hack, not a documented procedure. If you can use &lt;code&gt;sstableloader&lt;/code&gt; or proper replication, do that instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental rsync is a lifesaver.&lt;/strong&gt; Each run got faster and gave me confidence that the final cutover would be smooth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expect noisy logs.&lt;/strong&gt; Don't panic if the new nodes shout a lot when starting. Let them stabilize.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repair is mandatory.&lt;/strong&gt; Running &lt;code&gt;nodetool repair&lt;/code&gt; at the end ensures consistency across the new cluster.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Would I recommend this approach for everyone? Probably not. But in constrained environments, sometimes you need to think outside the box. &lt;/p&gt;

&lt;p&gt;For me, the "brain transplant" worked — three times in fact, with consistent results.&lt;/p&gt;

&lt;p&gt;It's one of those migration war stories worth sharing. &lt;br&gt;
If you're ever stuck without traditional migration paths,maybe this story gives you a bit of inspiration (and courage) to try something unconventional.&lt;/p&gt;




&lt;p&gt;✅ TL;DR: I migrated a ScyllaDB cluster by rsync'ing its data and commitlogs into a new cluster, booting it up, repairing it, and cutting over apps — a pragmatic "brain transplant" that worked when standard tools weren't an option.&lt;/p&gt;

</description>
      <category>scylladb</category>
      <category>bigdata</category>
      <category>database</category>
    </item>
  </channel>
</rss>
