Forem: Sandeep Kanabar

Shard your open-search indices like a pro!

Sandeep Kanabar — Sun, 31 Mar 2024 18:53:28 +0000

Ever struggled with the growing size of your OpenSearch cluster indices and wondered how you could efficiently manage them?

You can definitely leverage Index State Management (ISM) policies but knowing the intricacies of sharding goes a long way in helping you scale your cluster efficiently while keeping its performance optimal and most importantly keep a check on cost ($$).

One might wonder why the need to learn sharding strategies in a managed service. Isn't the service supposed to handle it for you? Well, yes and NO. One analogy that I can give is of cars. There are cars with automatic gears and cars with manual gears. A lot of people these days prefer automatic cars. They run great on highways but what if you need to navigate the car through crowded traffic streets with curvy turns? The automatic car "would" work but it would be far from efficient, performant and scalable. The same goes for managed services.

So how do you shard your indices efficiently? Let's begin with the basics.

First determine if your indices are something that can be organised as time-based indices. In that case, opting for day-wise, weekly, monthly or even yearly indices may make sense. Note that you can always mix-n-match meaning you can club between day-wise, weekly, monthly, yearly. Say your current data flows into day-wise indices and then you could have an ISM policy such that data older than 90 days is re-indexed into monthly indices. Or you can have monthly indices for current year and past years monthly indices could be re-indexed into yearly indices.

You can also leverage the open-source curator tool to manage all this using yaml scripts, as an alternative to ISM policies.

So the question that arises is: why would you begin with day-wise indices if you want to merge (aka reindex) them later into monthly? why begin with monthly if you want to reindex them to yearly?

The answer to this lies in "performance" and "efficiency" and also balancing indexing (write performance) and querying (read performance).

Your current indices are getting live data. So you want to maximise indexing aka write performance. To maximize indexing performance, you can have more shards. More shards means more parallel writes leading to efficient writes.

But here's the catch - the more the shards, more the time taken to search across all of them! Meaning query performance will suffer with more shards. Thus, there's a trade-off between indexing vs query performance and the trick lies in striking a balance.

So how do we strike that balance. One way is to keep current day's index with more number of shards but past indices which have no data flowing in, can be "reindexed" to reduce the no of shards and then force-merged to reduce to a single (1) segment. When you re-index, the index name will change but with aliasing, this is simplified. Simply flip your alias to point to the new index name. This strategy helps to boost search/query performance and at the same time keep the indexing performance great as well. Win-win!

So how many shards should you begin with?

Arrive at how much data flows per day. Say 30 GB. Now, how many nodes do you have? Say 3 nodes. In that case, having 3 primary shards will be optimal as each node will have 1 shard and you can configure 1 replica. However, in this case, the primary shard size of 10 GB is way too small. Ideally, shards should be around 30-50 GB.

Say daily data flow is 16 GB. In that case, just having 1 primary shard would do. Remember that too many small-sized shards is very detrimental to performance as there's context switching involved with underlying lucene indices.

Let's say the indices are monthly indices and per day's average data flow is 30 GB. Thus per month it would average 30 GB * 30 days = 900 GB. Assuming 3 nodes, configuring this index with 12 shards would mean each primary shard size is 900/12 = 75 GB. The ideal size is 30-50 GB and this exceeds it by a huge margin. So let's look at having say 21 shards. In this case, each primary shard would be 900/21 = 43 GB. That's acceptable. With 24 shards, 900/24 = 37.5 GB primary shard size is also a good option. Remember that too large shards take a long time to move between nodes and slow down cluster recovery.

This also helps you to plan your cluster capacity and scale it accordingly. Archiving old data into cold storage (s3/Azure storage blob/gcs bucket) is a good option. Another option is to snapshot old indices and then just delete them. The snapshots are stored in gcs/s3/azblobs which is cheap and can be easily restored on demand.

Let's say you have an index for a customer master and the dataset is v less like just a master inventory and less than 10 MB. In that case, a single shard and single replica would suffice. You can also opt for 2 replicas but it all depends. If you take regular snapshots, then you can manage fine with 1 replicas. More replicas means more storage and more $$.

Time-based indices have an advantage when it comes to snapshots. Say you have monthly indices and today is 31-March-2024 and you have snapshotted data pertaining to March-2022 into snap-my_monthly_index_2022_03. In that case, you can delete the index corresponding to March 2022 from your cluster if no searches are being performed against it and save storage costs. And in case it's later needed, you can easily restore the index quickly from snapshots.

Hopefully, this beginner guide helps you in your sharding journey. Good luck.

Benefits of time-based indices on reindex, alias, ilm

Sandeep Kanabar — Wed, 01 Feb 2023 11:58:31 +0000

In addition to the benefits listed here, using time-based indices helps in the following:

Avoids having to reindex entire data
Efficient Deletion and application of ILM
Easy to include / exclude indices based on alias

1. Avoids having to reindex entire data

If the data influx increases, we could easily set "number_of_shards": 3 in the index template and this would get effected for tomorrow's day-wise index. Without the need to reindex any data, the number of shards could be easily changed.

2. Efficient Deletion and application of ILM

Let's say we need to retain data upto 90 days. Thus, for a day-wise index which is older than 90 days, that entire index can be purged / deleted. This is far more efficient than purging older records from indices which makes them pretty un-optimised from search perspective.
Also, index lifecycle management becomes simplified with time-based indices.

3. Easy to include / exclude indices based on alias

Let's assume the cluster needs to retain 90 days data but needs to search only on the last 60 days data. Alias to the rescue. In this case, define an alias in index template that gets mapped to newly created day-wise indices. As soon as a past index becomes older than 60 days, the alias is removed from that index. This ensures that at any given point of time, the alias will point to a maximum of 60 day-wise indices.

Sizing shards using time-based indices

Sandeep Kanabar — Wed, 01 Feb 2023 11:41:36 +0000

This post lists a few advantages of making use of time-based indices (as well as DataStreams) in Elasticsearch.

Increasing / Decreasing the number of shards becomes easy
Helps to plan cluster capacity and growth size
Easily determine optimum number of shards

1. Increasing / Decreasing the number of shards becomes easy

Say, an index template that makes use of day-wise indices is configured with 1 shard in index settings. In case the indexing rate is slow or the shard size becomes too large (> 40-50 GB), the index template can be easily modified to increase the number_of_ shards to 3 or 5 or n. And this gets effected from the next day. Similarly, if a day-wise index pattern is configured with more than required number of shards oversharded, reducing the number of shares becomes pretty easy as it's just a matter of changing the template which would be effected next day (unless re-indexing is done).

2. Helps to plan cluster capacity and growth size

Let's say 100 events per second flow into an Elasticsearch cluster and each event averages about 1 KB in size. Thus, per day, there would be:
86400 seconds * 100 events/second = 8,640,000 events.

Since each event averages about 1 KB, the total size of 8,640,000 events = 8,640,000 * 1 KB = 8,640,000 KB / (1024 * 1024) = ~8.24 GB.

Thus, with a day-wise index, we could see that the day-wise index size would be ~9 GB per day without any replicas. Considering 1 replica, the size per day would be ~18 GB and size for 30 days would be ~540 GB. This helps with capacity planning and estimating cluster growth rate.

3. Easily determine optimum number of shards

With data set of about 9GB per day, for a day-wise index, we could start by setting "number_of_shards" : 1 in the index template since each primary shard would be about 9 GB which is pretty reasonable for a single shard. Shards for time-based indices can be in the range of 10-50 GB as mentioned here. With a bit of trial and error based on the daily ingestion rate, we can arrive at Optimum shard size that helps in stabilizing the cluster and boosting performance.

Guido Lena Cota - Practice Mock Tests for (ECE) exam

Sandeep Kanabar — Wed, 25 Jan 2023 12:39:25 +0000

While I was preparing for the Elastic Certified Engineer exam, I googled about mock practice tests and chanced upon this excellent blog series of 4 tests.(Scroll to the bottom of the blog for links).

Attempting the exercises in these links did wonders to my confidence as I found lapses in my preparation strategy, got a chance to correct them and also got exposure to the questions and how to answer them.

As for the format, while it can be found from Rich Raposa’s webinar, the webinar tells you what to expect but attempting the mock exercises actually gives you a feel of the exam which makes a difference.

From the exercises, the first part is about deploying and operating a cluster. The first exercise – configure the nodes to avoid split brain - is no longer on the agenda so that can be skipped. Same way exercise 2 can be skipped as well. Exercise 3 is about optimization and could help you with Cluster Management section e.g. diagnose shard issues and repair a cluster’s health. Exercise 4 is about snapshots, cross-cluster search which again comes under Cluster Management so both exercise 3 and 4 are helpful.

The second part is pretty relevant as it covers the topics of data management – index, templates (index and dynamic) and also data processing – reindex, ingest pipeline, painless scripting. One thing to note is that the exam can combine multiple objectives into one question. E.g. a question on re-index could also include ingest pipelines. Do attempt all the exercises in this post.

The third part covers mapping and text analysis which would be under Data Processing Category. This is very important and would help to attempt all the exercises here. Note that this does NOT have exercises for Runtime fields.

The final part covers exercises for Search and Aggregations and forms a very important part of exam preparation. One thing to note is: Pipeline and Matrix aggregations don’t look to be in the exam agenda. Please skip whatever doesn’t conform to the agenda. In case of any doubts about whether a particular topic is on exam agenda or not, just drop an email to certification[at]elastic[dot]co. The Elastic Certification team is pretty responsive and helpful.

Final Note: These exercises are excellent to boost your preparation strategy and level up your confidence but do NOT cover all the topics so you'll likely need mock exercises from other sources for the missing topics.

Practice Mock Tests for the Elastic Certified Engineer (ECE) exam

Sandeep Kanabar — Sun, 22 Jan 2023 17:35:22 +0000

So you've gone through the syllabus and prepared all the necessary topics for the ECE exam. You are days away from attempting the exam. How do you get that booster dose of confidence to tackle the exam? How do you avoid those nervous moments on seeing the first question in the exam and feeling your heart sink? There are times when we prepare the syllabus in entirety and yet we are short of confidence in facing the exam due to lack of practice of facing the questions.

That's where Mock tests and practice exercises can help. They help us to validate and solidify our preparation strategy. They sort of give us the much needed fillip to bolster our confidence and face the exam. Imagine preparing a question and seeing a similar one appearing in the exam. Instead of nervousness, we smile and are excited to tackle the exam.

Once my preparation was done, roughly 2-3 days before the exam, I began preparing mock tests that were available for free. I didn't have much time in hand and was just looking to up my confidence. Also, preparing is one thing but seeing the questions and forming your thoughts and answering is altogether a different thing.

For my exam, I gained a lot of confidence by practising the mock tests and exercises available for free at guido-lenacota and acloudguru, however you can also choose to purchase the elastic subscriptions. The standard and professional versions come with a practice exam attempt which is very beneficial. And even Udemy has come up with a practice test at which is pretty affordable.

Hope this post helps you in your certification journey.

Disclaimer: This is not to suggest that mock tests and practice exercises are mandatory. Even without them, you can ace the exam and pass it with flying colours. It's just that for some of us, we feel better prepared on attempting mock tests.

Elastic Certified Engineer Certification - is it worth?

Sandeep Kanabar — Sun, 08 Jan 2023 12:04:30 +0000

Is it worth achieving Elastic Certified Engineer Certification?

Often times we wonder if Elastic Certified Engineer exam is worth the effort?

Will it add value to our career?
Will it boost our knowledge and understanding of the core concepts?
Will it help us with practical implementation of ES in our day-to-day job?

I had the very same questions when I signed up for the Elastic Certified Engineer exam and it wasn't until I actually got to solid preparation that most of my above doubts were laid to rest. The remaining lingering doubts got resolved when I had a cursory glance at the excellent questions that were part of my exam.

This blog post is a culmination of my thoughts on above questions. Please note that I'm sharing this entirely in my personal capacity and passion.

Thoughts at Preparation Stage

After I signed up for the exam, given that we all have a busy work life, I kept wondering why did I even sign up for this? I kept telling myself why did I unnecessarily over-burden myself. As the deadline kept approaching, these voices grew louder and a part of me just wanted to skip giving this exam. "After all, no one would know if you don't give the exam. Just chuck it" - my mind voice told me. I wasn't convinced. I was determined to give the exam since I was very passionate about ES. "You've quite a few years of experience with ES. You don't need to practice much. You'll easily sail through", the voice croaked again. For a better part I listened to that voice until it was just about 3 weeks to go for the exam. I then decided to sit down and go through the agenda first to get a gist of the topics and my heart sank. There was SO MUCH I DIDN'T KNOW.

In an instant, all that overconfidence vanished. All those thoughts of I-know-ES-well melted. Humility returned. Began reading very seriously every single day for few hours. Poured through the amazing documentation and found there was so much to learn. So many new things. So much of un-learning and re-learning.

I was pretty familiar with ES 6.x managing multiple 6.x stacks in production but the exam back then was on 7.13 version. (Right now it's on 8.1 version. Elastic doesn't change exam versions too frequently and that's a good thing). There were tons and tons of improvements from 6.x to 7.x that it felt overwhelming. Went through all the breaking changes doc to familiarise myself with what's new and then came back to the exam agenda. While I was quite familiar with ingest processors, I realised that a large number of ingest processors had been added in 7.x that would make my work a breeze. Previously what would me take a block of code with if and else would now just be few lines.

I had fond memories of implementing cross-cluster replication using Kafka MirrorMaker way back in 2017. It was refreshing to see Cross-Cluster Replication (CCR) in action making the replication so easy and super intuitive. I was used to one way of writing queries and reading the documentation during preparation made me realise I could write it much better and more efficiently. There were so many AHA moments during preparing and I made multiple mental notes of correcting a couple of things and replacing those with their more efficient counterparts.

The exam required knowledge of runtime fields and this was something very new for me. Runtime fields are incredibly powerful if used the right way. As I continued preparing, I couldn't help but thank my stars that I signed up for the exam as it made me go through all these topics and get a good grasp on a number of things I wasn't aware of.

I always made use of time-based indices and it was a pleasant surprise to see the wonderful engineers at Elastic introduce data streams which was a much-needed feature.

Thoughts during Exam time

The first thing I did when my exam started was to have a cursory glance at the questions to gauge my preparation and confidence level. One look at the questions and I couldn't help marvelling. The questions were very practical, in fact quite a few matched what I was actually doing in my day-to-day job. I had attempted a particular re-indexing strategy at work and a question on re-index was pretty similar to what I had done. I smiled. The questions on queries, aggregation, ccr were so relevant and resembling real world scenarios. I also realised there was neither guesswork nor cramming involved here. You needed to really understand the ins and outs of concepts to clear this exam and at that point I felt really glad that I took up this exam which enriched my knowledge immensely. A huge shoutout and respect to the creators of this exam.

Conclusion

Hopefully, this blog post makes it easier for you in case you are wondering if it's worth attempting the Elastic Certified Engineer Exam. YES, IT IS. It would be the most productive use of your time and effort. Go for it.

Acing Elastic Certified Engineer Exam

Sandeep Kanabar — Sun, 08 Jan 2023 03:03:13 +0000

This blog post shares some basic tips and techniques that can be helpful in acing the Elastic Certified Engineer Exam.

First things first

Check the Exam Version (8.1 at the time of writing this post)
Note the duration - 3 hours
Check Exam Topics
Read the complete FAQ

All of the above can be found at Elastic Certified Engineer Exam page.

Please do not miss to read the FAQ.

Next Steps

Without missing a beat, go through Rich Raposa's webinar. This is extremely important as it not only gives an overview of what to expect in the exam, but also shows how the exam environment looks like. In essence, it gives some familiarity with the exam environment which is immensely helpful. The webinar is dated July 2021 but it's pretty relevant.
Read this excellent post by Surbhi Mahajan. It has links to a plethora of resources including mock tests.
Re-read the Exam Topics. This is extremely important as it's easy to get way-laid during exam preparation when curiosity gets the better of us and start reading and trying out topics that are not on syllabus / agenda. If you have any questions related to the topics, send an email to certification@elastic.co. The certification folks at Elastic are pretty empathetic and very responsive.

Preparation Strategy

The exam topics has 5 main sections. The questions will be distributed across them so makes sense to prepare from all the sections.
From each section, read few topics and gain some confidence. Confidence is the key here. If you find losing your confidence, take up a topic that's more familiar, tackle it, regain confidence and take up challenging / un-familiar topics.
The exam has around 10 questions with a time duration of 3 hours (180 minutes). So approx 17 minutes per question with a buffer of 10 mins. During the exam, don't spend too much time on one question as it affects confidence. First attempt the questions that seem familiar and then use the remaining time to attempt other questions. There's partial grading so do attempt all the questions.
Be well-versed with Elastic Documentation. Know how to navigate and find the relevant info. The documentation is fully available to use during the exam so no need to memorise the APIs. Just remember where to look.

Hopefully this helps you in your preparation for the ECE Exam. Good luck!

Optimum Sharding strategy in OpenSearch

Sandeep Kanabar — Tue, 01 Nov 2022 06:56:27 +0000

This article explores a few tips on optimum sharding strategy in OpenSearch.

Using time-based indices wherever possible. There are a number of advantages of using time-based indices as mentioned in this article.
If unsure, begin with 1 shard. With time-based indices, it offers the flexibility of modifying the number of shards anytime.
E.g.

if the event count per second is 100 and each event is 1KB, then per day
number of events = 100 per sec * 86400 secs in day = 86,40,000
approx size of each event = 1KB
size of all events per day = 1 KB * 86,40,000 = 86,40,000 KB = ~9 GB per day

Each shard is good enough to hold around 30-50 GB data. In the above scenario, with a daily dataset size of 9 GB, a single shard should suffice in case of day-wise indices.

Consider another scenario -

If the event count per second is 200 and each event is 2KB, then per day
number of events = 200 per sec * 86400 secs in day = 1,72,80,000
approx size of each event = 2KB
size of all events per day = 2 KB * 1,72,80,000 = 3,45,60,000 KB = ~34 GB per day

Here also, a single shard might suffice but it would impact indexing making it slower. Opting for 3 primary shards would mean each shard would be ~12 GB.

For scenario discussed in point 3, the shards of size ~12 GB might look too smaller but then past indices being read-only could be force-merged to 1 segment. Alternatively, the no of shards could be reduced for past indices by re-indexing them, e.g. say reindex day-wise indices to monthly indices and then force-merge them. This could lead to 30 day-wise indices with each index have 1 shard (thereby total 30 shards for 30 indices) become a single monthly index with say 9 or 12 shards depending on the size of shards.
The best way is to experiment and find out what works best. Day-wise indices offer scope to experiment as the template could be easily modified to vary the no of shards for newly created indices.
Keep shards EVEN-sized even for different types of indices. Eg. say twitter index has 5 shards each of 10 GB, then design posts index such that the shard size for posts index is also approx around 10-15 GB or 10-20 GB. The reason being, if twitter index shard is 10 GB and posts index shard is say 50 GB, then it might lead to un-even disk space.

Feel free to add your questions / thoughts in the comments below.

Advantages of using time-based indices in OpenSearch

Sandeep Kanabar — Sat, 06 Nov 2021 22:37:17 +0000

This post lists a few advantages of using time-based indices in OpenSearch Cluster.

Increasing / Decreasing the number of shards becomes easy
Helps to plan cluster capacity and growth size
Easily determine optimum number of shards
Avoids having to reindex entire data
Efficient Deletion and application of ISM
Easy to include / exclude indices based on alias
Snapshot and Restore becomes a breeze with day-wise indices
Apply best_compression to day-wise indices
Force-merge past indices

1. Increasing / Decreasing the number of shards becomes easy

Say, an index template that makes use of day-wise indices is configured with 1 shard in index settings. In case the indexing rate is slow or the shard size becomes too large (> 50 GB), the index template can be easily modified to increase the number_of_ shards to 3 or 5. And this gets effected from the next day. Similarly, if a day-wise index pattern is configured with more than required number of shards (oversharded), reducing it becomes easy.

2. Helps to plan cluster capacity and growth size

Let's say 100 events per second flow into an OpenSearch cluster and each event averages about 1 KB in size. Thus, per day, there would be:
86400 seconds * 100 events/second = 8,640,000 events.

Since each event averages about 1 KB, the total size of 8,640,000 events = 8,640,000 * 1 KB = 8,640,000 KB / (1024 * 1024) = ~8.24 GB.

3. Easily determine optimum number of shards

4. Avoids having to reindex entire data

5. Efficient Deletion and application of ISM

Let's say we need to retain data upto 90 days. Thus, for a day-wise index which is older than 90 days, that entire index can be purged / deleted. This is far more efficient than purging records from indices.
Also, application of index state management becomes simplified with time-based indices.

6. Easy to include / exclude indices based on alias

7. Snapshot and Restore becomes a breeze with day-wise indices

Say you have an index named my_index-2021.11.04 created on Nov 04, 2021. On Nov 05, 2021 at say 00:45 hours when data is no longer being written to the my_index-2021.11.04, a snapshot, snap-my_index-2021.11.04 could be triggered for that index. This snapshot would contain just the my_index-2021.11.04. In case the index is deleted and needs to be restored, it can be easily restored from the snapshot snap-my_index-2021.11.04.

8. Apply best_compression to day-wise indices

The index template can be modified to set "codec": "best_compression" in index settings i.e.

    "settings": {
      "codec": "best_compression"
    }

Depending on the use case, this could help to save disk space from 10% to 30% or even more. The mileage would vary.

"codec": "best_compression" CANNOT be dynamically applied on existing open indices. The index needs to closed first, then the setting applied dynamically and then the index needs to be opened.

9. Force-merge past indices

Since the data gets written only to current day's index, in case no updation happens to past data, all past indices are effectively read-only. Thus, such indices can be forcemerged by setting "max_num_segments":1. This boosts search speed tremendously.